笨鸟编程-零基础入门Pyhton教程

 找回密码
 立即注册

蜘蛛

发布者: 笨鸟自学网



SiteMapSpider示例

最简单的示例:使用 parse 回叫:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...

使用特定回调处理某些URL,使用其他回调处理其他URL::

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),
    ]

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...

遵循中定义的站点地图 robots.txt 文件,仅跟踪其URL包含 /sitemap_shop ::

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]
    sitemap_follow = ['/sitemap_shops']

    def parse_shop(self, response):
        pass # ... scrape shop here ...

将SiteMapSpider与其他URL源合并::

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]

    other_urls = ['http://www.example.com/about']

    def start_requests(self):
        requests = list(super(MySpider, self).start_requests())
        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
        return requests

    def parse_shop(self, response):
        pass # ... scrape shop here ...

    def parse_other(self, response):
        pass # ... scrape other here ...
123456789
上一篇:命令行工具下一篇:选择器

Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )

GMT+8, 2024-9-8 09:56 , Processed in 0.026456 second(s), 17 queries .

© 2001-2020

返回顶部