笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

Scrapy 教程

XPath: 简介¶

此外 CSS ，scrapy选择器也支持使用 XPath 表达：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

XPath表达式是非常强大的，是抓取选择器的基础。实际上，CSS选择器在引擎盖下转换为xpath。如果仔细阅读shell中选择器对象的文本表示形式，可以看到这一点。

虽然可能不像CSS选择器那么流行，但xpath表达式提供了更多的功能，因为除了导航结构之外，它还可以查看内容。使用xpath，您可以选择如下内容：*选择包含文本“下一页”*的链接。这使得xpath非常适合于抓取任务，并且我们鼓励您学习xpath，即使您已经知道如何构造css选择器，它也会使抓取更加容易。

我们在这里不会涉及很多XPath，但你可以阅读更多关于：ref：在这里使用带有Scrapy选择器的XPath <topics-selectors>。要了解有关XPath的更多信息，我们建议`本教程通过示例学习XPath <http://zvon.org/comp/r/tut-XPath_1.html>`_，以及`本教程学习“如何在XPath中思考 “<http://plasmasturm.org/log/xpath101/>`_。

提取引用和作者¶

既然您对选择和提取有了一些了解，那么让我们通过编写代码从网页中提取引号来完成蜘蛛程序。

Http://quotes.toscrape.com中的每个引号都由如下所示的HTML元素表示：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们打开Scrapy Shell并播放一点以了解如何提取所需数据：

$ scrapy shell 'http://quotes.toscrape.com'

我们得到了一个quote HTML元素的选择器列表，其中包括：

>>> response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 ...]

上面查询返回的每个选择器都允许我们对其子元素运行进一步的查询。让我们将第一个选择器分配给一个变量，这样我们就可以直接在特定的引号上运行CSS选择器：

>>> quote = response.css("div.quote")[0]

现在，让我们提取 text ， author 以及 tags 从引用中使用 quote 我们刚刚创建的对象：

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

鉴于标记是字符串列表，我们可以使用 .getall() 方法获取所有这些参数：

>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

找到了如何提取每个位之后，我们现在可以迭代所有的quotes元素，并将它们放在Python字典中：

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...