SCRAPPY(/ˈSkreɪpaɪ/)是一个应用程序框架,用于抓取网站和提取结构化数据,这些数据可用于广泛的有用应用程序,如数据挖掘、信息处理或历史存档。 尽管Scrapy最初是为 web scraping 它还可以用于使用API提取数据(例如 Amazon Associates Web Services )或者作为一个通用的网络爬虫。 浏览示例蜘蛛¶为了向您展示Scrapy带来的内容,我们将以最简单的方式运行蜘蛛,向您介绍Scrapy Spider的示例。 以下是一个蜘蛛代码,它从网站http://quotes.toscrape.com上抓取著名的引语,按照以下页码: import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
把它放在一个文本文件中,命名为 scrapy runspider quotes_spider.py -o quotes.jl
完成后,您将 {"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
... |
Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )
GMT+8, 2024-12-21 21:12 , Processed in 0.039855 second(s), 17 queries .