笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

常用做法

2022-2-21 06:24| 发布者: 笨鸟自学网| 查看: 3286| 评论: 0

摘要: 本节记录使用Scrapy时的常见做法。这些内容涵盖了许多主题，通常不属于任何其他特定部分。从脚本中运行Scrapy¶你可以使用API从脚本运行scrapy，而不是运行scrapy via的典型方式scrapycrawl.记住，scrappy构建在Twis ...

在同一进程中运行多个spider¶

默认情况下，当您运行时，scrapy为每个进程运行一个spider scrapy crawl . 但是，Scrapy支持使用 internal API .

下面是一个同时运行多个蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用相同的示例 CrawlerRunner ：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

同样的例子，但是通过链接延迟来按顺序运行spider：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

参见

从脚本中运行Scrapy .

123 / 3 页下一页

收藏分享邀请

		自动登录	找回密码
密码			立即注册

常用做法

在同一进程中运行多个spider¶

相关分类