创建项目¶在开始抓取之前,你必须建立一个新的零碎项目。输入要在其中存储代码并运行的目录: scrapy startproject tutorial
这将创建一个 tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
我们的第一只蜘蛛¶爬行器是您定义的类,Scrapy使用它从一个网站(或一组网站)中抓取信息。它们必须是子类 这是我们第一只蜘蛛的代码。将其保存在名为的文件中 import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
如您所见,我们的Spider子类
|
Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )
GMT+8, 2024-11-23 16:05 , Processed in 0.014217 second(s), 17 queries .