笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

Scrapy 教程

2022-2-21 05:57| 发布者: 笨鸟自学网| 查看: 10725| 评论: 0

摘要: 在本教程中，我们假定scrapy已经安装在您的系统上。如果不是这样的话，看安装指南.我们将抓取' quotes.toscrape.com http: quotes.toscrape.com=""/http: ' _，这是一个列出著名作家名言的网站。本教程将指导您完成 ...

使用蜘蛛参数¶

通过使用 -a 运行它们时的选项：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

这些论点被传给蜘蛛 __init__ 方法并默认成为spider属性。

在本例中，为 tag 参数将通过 self.tag . 您可以使用它使您的蜘蛛只获取带有特定标记的引号，并基于以下参数构建URL:：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果你通过 tag=humor 对于这个蜘蛛，您会注意到它只访问来自 humor 标记，如 http://quotes.toscrape.com/tag/humor .

你可以：参考：在这里学习更多关于处理蜘蛛参数的信息<spiderargs>。

下一步¶

本教程只介绍 Scrapy 的基础知识，但这里没有提到很多其他特性。检查：ref：`topics-whatelse`部分：ref：`intro-overview`一章，快速概述最重要的部分。

您可以继续阅读以下部分：ref：`section-basics`以了解有关命令行工具，蜘蛛，选择器以及本教程尚未涵盖的其他内容的更多信息，例如对已删除数据进行建模。如果您更喜欢使用示例项目，请查看：ref：`intro-examples`部分。

1 ... 2 3 4 5 6 7 8 910 / 10 页

收藏分享邀请

		自动登录	找回密码
密码			立即注册

Scrapy 教程

使用蜘蛛参数¶

下一步¶

相关分类