Python爬虫之CrawlSpider爬虫_python crawl post

作者：我家自动化 | 2024-05-20 22:16:42

踩

python crawl post

Python爬虫之CrawlSpider爬虫

一：CrawlSpider爬虫介绍
二：CrawlSpider相关基础
三：CrawlSpider实例
四：CrawlSpider总结
五：CrawlSpider爬虫案例

一：CrawlSpider爬虫介绍

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。在Python爬虫之Scrapy框架的使用博文中使用了Spider类来爬取数据。我们是自己在解析完整个页面后获取下一页的url，然后重新发送一个请求。
有时也会有这样的需求，只要满足某个条件的url都进行爬取。此时我们就可以通过CrawlSpider来完成。CrawlSpider继承自spider，只不过是在之前的基础上增加了新的功能。可以定义爬取的url的规则，以后scrapy碰到满足条件的url都进行爬取，而不用手动yield Request。

二：CrawlSpider相关基础

2.1 创建CrawlSpider爬虫

之前创建爬虫的方式是通过scrapy genspider [爬虫名字] [域名]的方式创建的。如果想要创建CrawlSpider爬虫，那么应该通过以下命令创建：

scrapy genspider -c crawl [爬虫名字] [域名]
1

2.2 LinkExtractors链接提取器

使用LinkExtractors可以不用程序员自己提取想要的url，然后发送请求。这些工作都可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取。以下对LinkExtractors类做一个简单的介绍：

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)
1
2
3
4
5
6
7
8
9
10
11
12
13

主要参数讲解：

allow：允许的url。所有满足这个正则表达式的url都会被提取。
deny：禁止的url。所有满足这个正则表达式的url都不会被提取。
allow_domains：允许的域名。只有在这个里面指定的域名的url才会被提取。
deny_domains：禁止的域名。所有在这个里面指定的域名的url都不会被提取。
restrict_xpaths：严格的xpath。和allow共同过滤链接。

2.3 Rule规则类

定义爬虫的规则类。以下对这个类做一个简单的介绍：

class scrapy.spiders.Rule(
    link_extractor, 
    callback = None, 
    cb_kwargs = None, 
    follow = None, 
    process_links = None, 
    process_request = None
)
1
2
3
4
5
6
7
8

主要参数讲解：

link_extractor：一个LinkExtractor对象，用于定义爬取规则。
callback：满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。
follow：指定根据该规则从response中提取的链接是否需要跟进。
process_links：从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。

三：CrawlSpider实例

3.1 创建项目及爬虫

scrapy startproject wxapp
cd wxapp
scrapy genspider -t crawl wxapp_spider wxapp-union.com
1
2
3

3.2 定义要爬取的url规则

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    rules = (
        # 匹配文章列表页分页url，follow=True表示爬取完第一页发现还有符合规则的url接着爬取
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
        # 匹配文件详情页url
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback = "parse_detail", follow = False),
    )
    def parse_item(self, response):
        title = response.xpath('//h1/text()').get()
        authors = response.xpath('//p[@class="authors"]')
        author = authors.xpath('.//a/text()').get()
        pub_time = authors.xpath('.//span/text()').get()
        content = response.xpath('//td[@id="article_content"]//text()').getall()
        content = ''.join(content).strip()
        item = WxappItem(title = title, author = author, pub_time = pub_time, content = content)
        yield item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

3.3 定义要保存的数据字段

import scrapy
class WxappItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    pub_time = scrapy.Field()
    content = scrapy.Field()
1
2
3
4
5
6
7
8

3.4 保存爬取到的数据

from scrapy.exporters import JsonLinesItemExporter
class WxappPipeline:
    def __init__(self):
        self.fp = open('wx.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii = False, encoding = 'utf-8')
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    def close_spider(self, spider):
        self.fp.close()
1
2
3
4
5
6
7
8
9
10

四：CrawlSpider总结

在写规则的时候,注意正则中的.和?号要转义,不然正则会错误!!
关于CrawlSpider爬虫,是在页面中查找所有符合正则规则的链接,如果需要进入链接继续跟进,那么就设置follow为True,如果只是需要提取这个链接的数据,那么只需要设置callback,并设置follow为False.
在运行的时候,这个爬虫并不会按照顺序依次一页一页的爬取,看起来好像有点随机爬取页面.
爬虫能自己去重,所以也不要担心有重复数据
需要使用LinkExtractor和Rule，这两个东西决定了爬虫的具体走向

五：CrawlSpider爬虫案例

下面的案例注意使用了两个item类，因为数据不在同一页面获取，而CrawlSpider不可以请求传参，所以将不同页面的数据使用了两item类收集。

1：爬虫文件内容

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from SunPro.items import SunproItem
from SunPro.items import DetailItem
class SunSpider(CrawlSpider):
    name = 'sun'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
    rules = (
        # 获取分页
        Rule(LinkExtractor(allow=r'id=1&page=\d+'), callback='parse_item', follow=True),
        # follow = True 可以将链接提取器继续作用到连接提取器提取到的链接所对应的页面中
        Rule(LinkExtractor(allow=r'id=\d+'), callback='parse_detail', follow=False)
    )
    # 如下两个解析方法中不可以实现请求传参，如传递meta；可以将数据保存到两个item中
    def parse_item(self, response):
        # xpath中不能出现tbody
        li_list = response.xpath('//ul[@class="title-state-ul"]/li')
        for li in li_list:
            detail_url = li.xpath('./span[@class="state3"]/a/@href').extract_first()
            # detail_url = urljoin(response.url, detail_url)
            title = li.xpath('./span[@class="state3"]/a/text()').extract_first()
            item = SunproItem(title=title)
            yield item
    def parse_detail(self, response):
        new_id = response.xpath('//div[contains(@class, "focus-date-list")]/span[4]//text()').extract_first()
        item = DetailItem(new_id=new_id)
        yield item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

2：items文件内容

import scrapy
class SunproItem(scrapy.Item):
    title = scrapy.Field()
class DetailItem(scrapy.Item):
    new_id = scrapy.Field()
1
2
3
4
5

3：pipeline文件内容

class SunproPipeline:
    def process_item(self, item, spider):
        print(item)
        # 判定item的类型
        if item.__class__.__name__ == 'DetailItem':
            print(item['new_id'])
        else:
            print(item['title'])
        return item
1
2
3
4
5
6
7
8
9

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/我家自动化/article/detail/599592