如何运行我们的蜘蛛¶要使蜘蛛正常工作,请转到项目的顶级目录并运行: scrapy crawl quotes
此命令运行名为的spider ... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
现在,检查当前目录中的文件。您应该注意到已经创建了两个新文件: quotes-1.html 和 引用-2.HTML, 将各个URL的内容作为 注解 如果您想知道为什么我们还没有解析HTML,请稍等,我们很快就会讨论这个问题。 引擎盖下面发生了什么?¶SCrapy计划 启动请求方法的快捷方式¶而不是实现 import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
这个 |
Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )
GMT+8, 2025-1-15 14:05 , Processed in 0.013630 second(s), 17 queries .