笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

蜘蛛

2022-2-21 06:01| 发布者: 笨鸟自学网| 查看: 9469| 评论: 0

摘要: spider是定义一个特定站点（或一组站点）如何被抓取的类，包括如何执行抓取（即跟踪链接）以及如何从页面中提取结构化数据（即抓取项）。换言之，spider是为特定站点（或者在某些情况下，一组站点）定义爬行和解析页 ...

start_requests()¶

此方法必须返回一个iterable，其中包含对此spider进行爬网的第一个请求。当蜘蛛被打开爬取的时候，它被称为 Scrapy。Scrapy只调用一次，因此可以安全地实现 start_requests() 作为发电机。

默认实现生成 Request(url, dont_filter=True) 对于每个URL start_urls .

如果要更改用于开始抓取域的请求，这是要重写的方法。例如，如果您需要从使用POST请求登录开始，可以执行以下操作：

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

parse(response)¶

这是Scrapy在请求未指定回调时用来处理下载响应的默认回调。

这个 parse 方法负责处理响应，并返回爬取的数据和/或更多的URL。其他请求回调与 Spider 班级。

此方法以及任何其他请求回调都必须返回 Request 和/或 item objects 。

参数: response (Response) -- 解析的响应

log(message[, level, component])¶: 通过Spider的 logger ，保持向后兼容性。有关详细信息，请参阅从蜘蛛记录 .

closed(reason)¶: 蜘蛛关闭时调用。此方法为 spider_closed 信号。

我们来看一个例子：

import scrapy


class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)

从单个回调返回多个请求和项目：

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield {"title": h3}

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

而不是 start_urls 您可以使用 start_requests() 直接；为数据提供更多的结构，您可以使用 Item 对象：：

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse) 

123 4 5 6 7 8 9 / 9 页下一页

收藏分享邀请

		自动登录	找回密码
密码			立即注册

蜘蛛

相关分类