笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

蜘蛛

XmlFeedSpider示例¶

这些蜘蛛很容易使用，让我们来看一个例子：

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.getall()))

        item = TestItem()
        item['id'] = node.xpath('@id').get()
        item['name'] = node.xpath('name').get()
        item['description'] = node.xpath('description').get()
        return item

基本上，我们在那里所做的是创建一个蜘蛛，它从给定的源下载提要 start_urls ，然后循环访问它的每个 item 标记，将它们打印出来，并将一些随机数据存储在 Item 。

CSVFeedSpider¶

classscrapy.spiders.CSVFeedSpider[源代码]¶

这个spider与xmlFeedSpider非常相似，只是它迭代行，而不是节点。在每次迭代中被调用的方法是 parse_row() .

delimiter¶: 带有csv文件中每个字段分隔符的字符串默认为 ',' （逗号）

quotechar¶: 带有csv文件中每个字段的外壳字符的字符串默认为 '"' （引号）。

headers¶: csv文件中的列名列表。

parse_row(response, row)[源代码]¶: 接收响应和dict（代表每一行），其中为csv文件的每个提供的（或检测到的）头文件都有一个键。这个蜘蛛还提供了超越的机会 adapt_response 和 process_results 用于预处理和后处理目的的方法。

CSVFeedspider示例¶

我们来看一个类似于前一个的例子，但是使用 CSVFeedSpider ：：

from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem

class MySpider(CSVFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.csv']
    delimiter = ';'
    quotechar = "'"
    headers = ['id', 'name', 'description']

    def parse_row(self, response, row):
        self.logger.info('Hi, this is a row!: %r', row)

        item = TestItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['description'] = row['description']
        return item