当前位置:   article > 正文

Scrapy结合MongoDB源码重构,打磨完美指纹存储机制!

scapy mongodb

本篇文章将带给各位读者关于 Scrapy 与 MongoDB 的结合,打磨出完美的指纹存储机制,同时也解决了 Redis 内存压力的问题。我们将深入探讨Scrapy-Redis 源码的改造,使其可以根据不同场景进行灵活配置和使用。欢迎各位读者阅读并参与讨论!

特别声明:本公众号文章只作为学术研究,不作为其他不法用途;如有侵权请联系作者删除。

bdcfde141900258d7ae6419c4803298d.gif

这是「进击的Coder」的第 937 篇技术分享

作者:TheWeiJun

来源:逆向与爬虫的故事

阅读本文大概需要 17 分钟。

立即加星标

6df4a125dc61b067b11293cb6fc4297b.png

每月看好文

 目录


一、前言介绍

二、架构梳理

三、源码分析

四、源码重写

五、文章总结

d81316301f483be5a2d61d8e9e9d9278.gif


一、前言介绍

在使用 Scrapy-Redis 进行数据采集时,经常会面临着 Redis 内存不足的困扰,特别是当 Redis 中存储的指纹数量过多时,可能导致 Redis 崩溃、指纹丢失,进而影响整个爬虫的稳定性。那么,面对这类问题,我们应该如何应对呢?我将在本文中分享解决方案:通过改造 Scrapy-Redis 源码,引入 MongoDB 持久化存储,从根本上解决了上述问题。敬请关注我的文章,一起探讨这个解决方案的实现过程,以及带来的收益和挑战。

二、架构梳理

1、进行源码分析之前,我们需要先了解下 scrapy 及 scrapy-redis 的架构图,两者相比,是哪些地方进行了改造?带着这样的疑问,我们来看下两个框架的架构图:

26ea545da423c1d869c67c59a1ec8814.png

                                                         图1(scrapy架构图)

38c061a395f51d036dd643800675476d.png

图2(scrapy-redis架构图)

2、拿 图2 同 图1 对比,我们可以看到 scrapy-redis 在 scrapy 的架构上增加了 redis,基于 redis 的特性拓展了如下四种组件:Scheduler,Dupfilter,ItemPipeline,BaseSpider,这也是为什么在 redis 中会生成spider:requests、spider:items、spider:dupfilter 三个 key 的原因。接下来我们进入源码分析环节,来看看 scrapy-redis 如何进行指纹改造吧。


三、源码分析

1、分析 scrapy-redis 源码,我们在使用 scrapy-redis 时,在 settings 模块都会进行如下配置:

2595b1f7ca8e162141371adc337a206a.png

总结:这里面的三个参数,分别同 redis 进行请求出入、请求指纹、请求优先级交互,如果我们想要修改 redis 指纹模块,那么我们需要对 RFPDupeFilter 模块进行重写,从而结合 mongodb 进行大量指纹存储,接下来进入源码分析环节。

2、阅读分析 RFPDupeFilter 源码,我们先来附上 RFPDupeFilter 完整源码如下:

  1. import logging
  2. import time
  3. from scrapy.dupefilters import BaseDupeFilter
  4. from scrapy.utils.request import request_fingerprint
  5. from . import defaults
  6. from .connection import get_redis_from_settings
  7. logger = logging.getLogger(__name__)
  8. # TODO: Rename class to RedisDupeFilter.
  9. class RFPDupeFilter(BaseDupeFilter):
  10. """Redis-based request duplicates filter.
  11. This class can also be used with default Scrapy's scheduler.
  12. """
  13. logger = logger
  14. def __init__(self, server, key, debug=False):
  15. """Initialize the duplicates filter.
  16. Parameters
  17. ----------
  18. server : redis.StrictRedis
  19. The redis server instance.
  20. key : str
  21. Redis key Where to store fingerprints.
  22. debug : bool, optional
  23. Whether to log filtered requests.
  24. """
  25. self.server = server
  26. self.key = key
  27. self.debug = debug
  28. self.logdupes = True
  29. @classmethod
  30. def from_settings(cls, settings):
  31. """Returns an instance from given settings.
  32. This uses by default the key ``dupefilter:<timestamp>``. When using the
  33. ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
  34. it needs to pass the spider name in the key.
  35. Parameters
  36. ----------
  37. settings : scrapy.settings.Settings
  38. Returns
  39. -------
  40. RFPDupeFilter
  41. A RFPDupeFilter instance.
  42. """
  43. server = get_redis_from_settings(settings)
  44. # XXX: This creates one-time key. needed to support to use this
  45. # class as standalone dupefilter with scrapy's default scheduler
  46. # if scrapy passes spider on open() method this wouldn't be needed
  47. # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
  48. key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
  49. debug = settings.getbool('DUPEFILTER_DEBUG')
  50. return cls(server, key=key, debug=debug)
  51. @classmethod
  52. def from_crawler(cls, crawler):
  53. """Returns instance from crawler.
  54. Parameters
  55. ----------
  56. crawler : scrapy.crawler.Crawler
  57. Returns
  58. -------
  59. RFPDupeFilter
  60. Instance of RFPDupeFilter.
  61. """
  62. return cls.from_settings(crawler.settings)
  63. def request_seen(self, request):
  64. """Returns True if request was already seen.
  65. Parameters
  66. ----------
  67. request : scrapy.http.Request
  68. Returns
  69. -------
  70. bool
  71. """
  72. fp = self.request_fingerprint(request)
  73. # This returns the number of values added, zero if already exists.
  74. added = self.server.sadd(self.key, fp)
  75. return added == 0
  76. def request_fingerprint(self, request):
  77. """Returns a fingerprint for a given request.
  78. Parameters
  79. ----------
  80. request : scrapy.http.Request
  81. Returns
  82. -------
  83. str
  84. """
  85. return request_fingerprint(request)
  86. @classmethod
  87. def from_spider(cls, spider):
  88. settings = spider.settings
  89. server = get_redis_from_settings(settings)
  90. dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)
  91. key = dupefilter_key % {'spider': spider.name}
  92. debug = settings.getbool('DUPEFILTER_DEBUG')
  93. return cls(server, key=key, debug=debug)
  94. def close(self, reason=''):
  95. """Delete data on close. Called by Scrapy's scheduler.
  96. Parameters
  97. ----------
  98. reason : str, optional
  99. """
  100. self.clear()
  101. def clear(self):
  102. """Clears fingerprints data."""
  103. self.server.delete(self.key)
  104. def log(self, request, spider):
  105. """Logs given request.
  106. Parameters
  107. ----------
  108. request : scrapy.http.Request
  109. spider : scrapy.spiders.Spider
  110. """
  111. if self.debug:
  112. msg = "Filtered duplicate request: %(request)s"
  113. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  114. elif self.logdupes:
  115. msg = ("Filtered duplicate request %(request)s"
  116. " - no more duplicates will be shown"
  117. " (see DUPEFILTER_DEBUG to show all duplicates)")
  118. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  119. self.logdupes = False

3、我们对 scrapy-redis dupfilter.py 源码进行分析如下:

771d82c249a612062b7abfd24a7ba60a.png

解读:request_seen 方法中的 self.request_fingerprint 方法会对请求指纹进行 sha1 加密运算得到一个 40 位长度的 fp 参数,然后 redis set 会对该指纹进行 add 添加,如果指纹不存在则返回 True,return True==0 则最后结果返回 False,如果指纹存在则返回 True,return False==0 则最后结果返回 True。接下来分析下调度器是如何进行最终指纹判重的!

4、我们分析 Schedulter 源码,查看 Scheduler 对请求进行入队列处理逻辑如下:

bb7e93d8a896233752d45b2c43d99b2e.png

解读:通过分析 enqueue_request 方法,我们可以看到相关逻辑,如果该请求设置为去重并且 request_seen 方法返回为 True,则该请求不入队列;相反该请求需要入队列,并进行相关数据自增统计。

总结:其实分析到这里,我们只需要修改 request_seen 方法,即可完成 scrapy-redis fp 源码改造,通过结合 mongodb,实现各种爬虫 fp 指纹持久化存储;长话短说,接下来进入源码重写环节。


四、源码重写

1、首先我们需要在 settings 里配置 mongodb 相关参数,代码如下:

  1. MONGO_DB = "crawler"
  2. MONGO_URL = "mongodb://localhost:27017"

2、紧接着笔者通过继承重写BaseDupeFilter源码,自定义去重模块 MongoRFPDupeFilter 源码如下:

  1. import logging
  2. import time
  3. from pymongo import MongoClient
  4. from scrapy.dupefilters import BaseDupeFilter
  5. from scrapy.utils.request import request_fingerprint
  6. from scrapy_redis import defaults
  7. logger = logging.getLogger(__name__)
  8. class MongoRFPDupeFilter(BaseDupeFilter):
  9. """Redis-based request duplicates filter.
  10. This class can also be used with default Scrapy's scheduler.
  11. """
  12. logger = logger
  13. def __init__(self, key, debug=False, settings=None):
  14. self.key = key
  15. self.debug = debug
  16. self.logdupes: bool = True
  17. self.mongo_uri = settings.get('MONGO_URI')
  18. self.mongo_db = settings.get('MONGO_DB')
  19. self.client = MongoClient(self.mongo_uri)
  20. self.db = self.client[self.mongo_db]
  21. self.collection = self.db[self.key]
  22. self.collection.create_index([("_id", 1)])
  23. @classmethod
  24. def from_settings(cls, settings):
  25. key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
  26. debug = settings.getbool('DUPEFILTER_DEBUG')
  27. return cls(key=key, debug=debug, settings=settings)
  28. @classmethod
  29. def from_crawler(cls, crawler):
  30. """Returns instance from crawler.
  31. Parameters
  32. ----------
  33. crawler : scrapy.crawler.Crawler
  34. Returns
  35. -------
  36. RFPDupeFilter
  37. Instance of RFPDupeFilter.
  38. """
  39. return cls.from_settings(crawler.settings)
  40. def request_seen(self, request):
  41. """Returns True if request was already seen.
  42. """
  43. fp = self.request_fingerprint(request)
  44. # This returns the number of values added, zero if already exists.
  45. if self.collection.find_one({'_id': fp}):
  46. return True
  47. self.collection.insert_one(
  48. {'_id': fp, "crawl_time": time.strftime("%Y-%m-%d")})
  49. return False
  50. def request_fingerprint(self, request):
  51. return request_fingerprint(request)
  52. @classmethod
  53. def from_spider(cls, spider):
  54. settings = spider.settings
  55. dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)
  56. key = dupefilter_key % {'spider': spider.name}
  57. debug = settings.getbool('DUPEFILTER_DEBUG')
  58. return cls(key=key, debug=debug, settings=settings)
  59. def close(self, reason=''):
  60. """Delete data on close. Called by Scrapy's scheduler.
  61. Parameters
  62. ----------
  63. reason : str, optional
  64. """
  65. self.clear()
  66. def clear(self):
  67. """Clears fingerprints data."""
  68. self.collection.delete(self.key)
  69. def log(self, request, spider):
  70. """Logs given request.
  71. Parameters
  72. ----------
  73. request : scrapy.http.Request
  74. spider : scrapy.spiders.Spider
  75. """
  76. if self.debug:
  77. msg = "Filtered duplicate request: %(request)s"
  78. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  79. elif self.logdupes:
  80. msg = ("Filtered duplicate request %(request)s"
  81. " - no more duplicates will be shown"
  82. " (see DUPEFILTER_DEBUG to show all duplicates)")
  83. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  84. self.logdupes = False

3、第三步,我们需要将继承重写的MongoRFPDupeFilter模块配置到settings文件中,代码如下:

  1. # 确保所有的爬虫实例使用Mongodb进行重复过滤
  2. DUPEFILTER_CLASS = "test_scrapy.dupfilter.MongoRFPDupeFilter"

4、编写测试爬虫(编写代码环节跳过),直接查看mongdb collection中fp结果,截图如下:

ce66aa973efa46fe4d12c51f79520418.jpeg

总结:到这里整个流程就结束了,接下来不管我们开发多少个爬虫,都默认使用mongodb对request fp指纹进行存储。最后我们来总结下scrapy-redis同scrapy-mongodb的指纹方式优缺点吧!

  • scrapy-redis    速度快,但由于指纹过大,内存不足会导致redis宕机,内存昂贵

  • scrapy+mongo    速度同redis相比,不是很优,优点是能存储大批量指纹,磁盘廉价


五、文章总结

亲爱的读者们,感谢你们与我一同在这个公众号里探索、学习。为了让我们能够更紧密地交流、共同进步,我特意开通了留言功能。这里不仅是一个分享知识的平台,更是一个携手成长的角落。欢迎你们在评论区留下你的学习心得、疑惑或者建议,让我们一起探讨、学习,共同成长。期待在这里,我们能够一起分享智慧的火花,点亮前行的道路。再次感谢你们的陪伴,让我们一起学习,一起成长!

本篇文章分享到这里就结束了,欢迎大家关注下期文章,我们不见不散af21e6b0c76c6016dc32ec2b5cbf813d.png8fc817d590754517468e996b538d69ea.pngfc35bc5be557c87f5ba25ffaceea928a.png

ba8d5832b5ebaa86483a0bf087c68908.gif

点分享

8929ded670f176c913d6c5e173b32fa3.gif

点收藏

e37b4bbb0dbb9a5d4f047864ca5db59d.gif

点点赞

5f57d4a69499008367712e64f1fb1e8a.gif

点在看

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/928783
推荐阅读
相关标签
  

闽ICP备14008679号