赞
踩
最近有一段时间没有更新自己的笔记了,一是因为这段时间工作忙一点在学很多新知识,二是年前年后太懒散了,也发生蛮多事情 ,心思完全不在这上面。希望后面可以鼓起劲来,good good study ,day day up!
简单说说我的理解,scrapy-redis可以看做scrapy功能的增加,主要多的功能是让scrapy支持了分布式,增加了持续去重,增加了断点续爬,增量爬取等功能。这样在你爬取许多更新的站点时,例如新闻站点,就可以实现增量爬取,自动过滤掉爬过的链接,非常的方便。
redis的安装非常的简单,网上有很多安装的分析文章,笔者在这里推荐一个
https://blog.csdn.net/weixin_43288034/article/details/107041599
只要在settings.py中加入一些redis配置参数就行了
# 指定Redis的主机名和端口
# REDIS_HOST = 'localhost'
# REDIS_PORT = 6379
# 使能Redis调度器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 所有spider通过redis使用同一个去重过滤器
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
# 将Requests队列持久化到Redis,可支持暂停或重启爬虫
SCHEDULER_PERSIST = True
# Requests的调度策略,默认优先级队列
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
注释的部分可以不用添加,指定Redis的主机名和端口可以添加在爬虫文件中。
基本配置已经完成,现在就开始编写爬虫。
目标网站:https://cn.nytimes.com/
这是一个外国新闻网站,我们需要分类爬取所有的新闻内容。
我们可以看到这里的新闻分成了很多个栏目。
点击不同栏目对应的url也产生了规律性的变化,基本就可以确定这个网页是个简单的get请求。再看看新闻链接已经新闻内容在哪。
从以上可以得出,我们需要的资源就在静态的页面中,可以直接请求到。
首先是主页面新闻链接,通过xpath可以提取:
def parse(self, response):
type_url = (response.url).replace('https://cn.nytimes.com/','').replace('/','')
content_url = response.xpath('//h3//a/@href').getall()
for i in content_url:
s = response.urljoin(i)
a = urljoin(s,'dual/')
yield response.follow(url=a, callback=self.parse_detail,meta={'Type': type_url})
然后是翻页:
nextpage_url = response.xpath('//*[@class="next"]//a/@href').getall()
if nextpage_url:
yield response.follow(url=nextpage_url, callback=self.parse,dont_filter=True)
注:这里是最简单的翻页方式,dont_filter=True这个参数是使得翻页url不会被redis过滤。
再往下是新闻详情页:
def parse_detail(self, response): item = {} type_url = response.meta['Type'] type_url = str(type_url) number = re.findall('\d+', type_url) if number: number = ''.join(number) else: number = '' type_url = type_url.replace(str(number),'').strip() item['url'] = response.url item['type_url'] = type_url news_paragragh_list = [] content = response.xpath('//*[@class="article-paragraph"]') for i in content: paragragh = i.xpath('.//text()').extract() paragragh = ''.join(paragragh) news_paragragh_list.append(paragragh.strip()) item['page'] = news_paragragh_list if news_paragragh_list: item['page'] = news_paragragh_list yield item
这样爬虫文件就基本完成了。
spiders完整代码:
import scrapy import re from redis import Redis from scrapy import cmdline, selector, Request from urllib.parse import urljoin class NewsSpider(scrapy.Spider): name = 'news' start_urls = ['https://cn.nytimes.com/world/','https://cn.nytimes.com/china/','https://cn.nytimes.com/business/','https://cn.nytimes.com/lens/','https://cn.nytimes.com/technology/','https://cn.nytimes.com/science/','https://cn.nytimes.com/health/','https://cn.nytimes.com/education/','https://cn.nytimes.com/travel/','https://cn.nytimes.com/culture/','https://cn.nytimes.com/style/','https://cn.nytimes.com/real-estate/','https://cn.nytimes.com/opinion/'] # 创建redis链接对象 conn = Redis(host='127.0.0.1',encoding='utf-8', port=6379) def parse(self, response): type_url = (response.url).replace('https://cn.nytimes.com/','').replace('/','') content_url = response.xpath('//h3//a/@href').getall() for i in content_url: s = response.urljoin(i) a = urljoin(s,'dual/') yield response.follow(url=a, callback=self.parse_detail,meta={'Type': type_url}) # 翻页 nextpage_url = response.xpath('//*[@class="next"]//a/@href').getall() if nextpage_url: yield response.follow(url=nextpage_url, callback=self.parse,dont_filter=True) def parse_detail(self, response): item = {} type_url = response.meta['Type'] type_url = str(type_url) number = re.findall('\d+', type_url) if number: number = ''.join(number) else: number = '' type_url = type_url.replace(str(number),'').strip() item['url'] = response.url item['type_url'] = type_url news_paragragh_list = [] content = response.xpath('//*[@class="article-paragraph"]') for i in content: paragragh = i.xpath('.//text()').extract() paragragh = ''.join(paragragh) news_paragragh_list.append(paragragh.strip()) item['page'] = news_paragragh_list if news_paragragh_list: item['page'] = news_paragragh_list yield item if __name__ == '__main__': cmdline.execute(["scrapy", "crawl", "news"])
import scrapy
class DemoItem(scrapy.Item):
paragragh = scrapy.Field()
type_url = scrapy.Field()
page = scrapy.Field()
我们将爬取好数据存放于mongodb数据库中
import pymongo class NewsSpider: def __init__(self): # 连接数据库 self.client = pymongo.MongoClient(host='localhost', port=27017) # 创建库 self.db = self.client['NewYorknewspaper'] # 创建表 self.table = self.db['text'] def process_item(self, item, spider): # print('====================================') self.table.insert(dict(item)) return item
注释:其实也可以使用scrapy_mongodb库,只需要在setting中配置对应参数即可。
这里只罗列再次修改的部分
#请求头设置 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-encoding': 'gzip, deflate, br', 'Accept-language': 'zh-CN,zh;q=0.9', 'Cookie': '这里可加入浏览时的cookie' } #这里开启了两个中间件,由于这个站点需要代理才能访问,所以这俩中间件一个是随机UA,一个是代理。 DOWNLOADER_MIDDLEWARES = { 'demo.middlewares.RandomUserAgentMiddleware': 543, 'demo.middlewares.ProxyMiddleware': 100 } #开启管道 ITEM_PIPELINES = { 'demo.pipelines.DemoPipeline': 300, } #自动限速(控制爬取速度,降低对方服务器压力) AUTOTHROTTLE_ENABLED = True
学习的积累是个漫长而又孤独的过程,每天多进步一点点,总有一天可以达成自己的目标,要相信光!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。