本文介绍了调试spider的最常用技术。请考虑下面的蜘蛛: import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = (
'http://example.com/page1',
'http://example.com/page2',
)
def parse(self, response):
# <processing code not shown>
# collect `item_urls`
for item_url in item_urls:
yield scrapy.Request(item_url, self.parse_item)
def parse_item(self, response):
# <processing code not shown>
item = MyItem()
# populate `item` fields
# and extract item_details_url
yield scrapy.Request(item_details_url, self.parse_details, cb_kwargs={'item': item})
def parse_details(self, response, item):
# populate more `item` fields
return item
基本上,这是一个简单的爬行器,它解析两页项目(Start_URL)。项也有一个包含附加信息的详细信息页面,因此我们使用 解析命令¶检查蜘蛛输出的最基本方法是使用 为了查看从特定URL中获取的项目: $ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 2 <<<
# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]
# Requests -----------------------------------------------------------------
[]
使用 $ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
[ ... scrapy log lines crawling example.com spider ... ]
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[<GET item_details_url>]
>>> DEPTH LEVEL: 2 <<<
# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]
# Requests -----------------------------------------------------------------
[]
检查从一个开始的项目,也可以很容易地实现使用:: $ scrapy parse --spider=myspider -d 3 'http://example.com/page1' |
Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )
GMT+8, 2025-1-22 21:12 , Processed in 0.017500 second(s), 17 queries .