MasterSpider 对 start_urls 中的 urls 构造 request,获取 response
MasterSpider 将 response 解析,获取目标页面的 url, 利用 redis 对 url 去重并生成待爬 request 队列
SlaveSpider 读取 redis 中的待爬队列,构造 request
SlaveSpider 发起请求,获取目标页面的 response
Slavespider 解析 response,获取目标数据,写入生产数据库
Scrapy 默认的全局并发限制对同时爬取大量网站的情况并不适用。 增加多少取决于爬虫能占用多少 CPU。 一般开始可以设置为 100 。
不过最好的方式是做一些测试,获得 Scrapy 进程占取 CPU 与并发数的关系。 为了优化性能,应该选择一个能使CPU占用率在80%-90%的并发数。
Redis 远程连接
# bind
Windows的小伙伴儿 pip是安装Scrapy可能会出现问题。推荐使用anaconda 、不然还是老老实实用Linux吧
conda install scrapy
pip install scrapy
conda install scrapy-redis
pip install scrapy-redis
开始之前我们得知道scrapy-redis的一些配置:PS 这些配置是写在Scrapy项目的settings.py中的!
# -*- coding: utf-8 -*- # Scrapy settings for companyNews project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'companyNews' SPIDER_MODULES = ['companyNews.spiders'] NEWSPIDER_MODULE = 'companyNews.spiders' #-----------------------日志文件配置----------------------------------- #日志文件名 #LOG_FILE = "dg.log" #日志文件级别 LOG_LEVEL = 'WARNING' # Obey robots.txt rules # robots.txt 是遵循 Robot协议 的一个文件,它保存在网站的服务器中,它的作用是,告诉搜索引擎爬虫, # 本网站哪些目录下的网页 不希望 你进行爬取收录。在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件, # 然后决定该网站的爬取范围。 # ROBOTSTXT_OBEY = True # ------------------------全局并发数的一些配置:------------------------------- # Configure maximum concurrent requests performed by Scrapy (default: 16) # 默认 Request 并发数:16 # CONCURRENT_REQUESTS = 32 # 默认 Item 并发数:100 # CONCURRENT_ITEMS = 100 # The download delay setting will honor only one of: # 默认每个域名的并发数:16 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 每个IP的最大并发数:0表示忽略 # CONCURRENT_REQUESTS_PER_IP = 0 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY 会影响 CONCURRENT_REQUESTS,不能使并发显现出来,设置下载延迟 #DOWNLOAD_DELAY = 3 # Disable cookies (enabled by default) #禁用cookies # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'haoduofuli (+http://www.yourdomain.com)' # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { 'companyNews.middlewares.UserAgentmiddleware': 401, } # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'companyNews.middlewares.UserAgentmiddleware': 400, # 'companyNews.middlewares.CookieMiddleware': 700, } MYEXT_ENABLED=True # 开启扩展 IDLE_NUMBER=10 # 配置空闲持续时间单位为 360个 ,一个时间单位为5s # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # 在 EXTENSIONS 配置,激活扩展 EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, 'companyNews.extensions.RedisSpiderSmartIdleClosedExensions': 500, } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 注意:自定义pipeline的优先级需高于Redispipeline,因为RedisPipeline不会返回item, # 所以如果RedisPipeline优先级高于自定义pipeline,那么自定义pipeline无法获取到item ITEM_PIPELINES = { #将清除的项目在redis进行处理,# 将RedisPipeline注册到pipeline组件中(这样才能将数据存入Redis) # 'scrapy_redis.pipelines.RedisPipeline': 400, 'companyNews.pipelines.companyNewsPipeline': 300,# 自定义pipeline视情况选择性注册(可选) } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # ----------------scrapy默认已经自带了缓存,配置如下----------------- # 打开缓存 #HTTPCACHE_ENABLED = True # 设置缓存过期时间(单位:秒) #HTTPCACHE_EXPIRATION_SECS = 0 # 缓存路径(默认为:.scrapy/httpcache) #HTTPCACHE_DIR = 'httpcache' # 忽略的状态码 #HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存模式(文件缓存) #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' #-----------------Scrapy-Redis分布式爬虫相关设置如下-------------------------- # Enables scheduling storing requests queue in redis. #启用Redis调度存储请求队列,使用Scrapy-Redis的调度器,不再使用scrapy的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. #确保所有的爬虫通过Redis去重,使用Scrapy-Redis的去重组件,不再使用scrapy的去重组件 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 默认请求序列化使用的是pickle 但是我们可以更改为其他类似的。PS:这玩意儿2.X的可以用。3.X的不能用 # SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 使用优先级调度请求队列 (默认使用), # 使用Scrapy-Redis的从请求集合中取出请求的方式,三种方式择其一即可: # 分别按(1)请求的优先级/(2)队列FIFO/(先进先出)(3)栈FILO 取出请求(先进后出) # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 可选用的其它队列 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # Don't cleanup redis queues, allows to pause/resume crawls. #不清除Redis队列、这样可以暂停/恢复 爬取, # 允许暂停,redis请求记录不会丢失(重启爬虫不会重头爬取已爬过的页面) #SCHEDULER_PERSIST = True #----------------------redis的地址配置------------------------------------- # Specify the full Redis URL for connecting (optional). # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings. # 指定用于连接redis的URL(可选) # 如果设置此项,则此项优先级高于设置的REDIS_HOST 和 REDIS_PORT # REDIS_URL = 'redis://root:密码@主机IP:端口' REDIS_URL = 'redis://root:123456@' # 自定义的redis参数(连接超时之类的) REDIS_PARAMS={'db': 2} # Specify the host and port to use when connecting to Redis (optional). # 指定连接到redis时使用的端口和地址(可选) #REDIS_HOST = '' #REDIS_PORT = 6379 #REDIS_PASS = '19940225' # REDIRECT_ENABLED = False # # HTTPERROR_ALLOWED_CODES = [302, 301] # # DEPTH_LIMIT = 3 #------------------------------------------------------------------------------------------------ # 最大空闲时间防止分布式爬虫因为等待而关闭 # 这只有当上面设置的队列类是SpiderQueue或SpiderStack时才有效 # 并且当您的蜘蛛首次启动时,也可能会阻止同一时间启动(由于队列为空) # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 序列化项目管道作为redis Key存储 # REDIS_ITEMS_KEY = '%(spider)s:items' # 默认使用ScrapyJSONEncoder进行项目序列化 # You can use any importable path to a callable object. # REDIS_ITEMS_SERIALIZER = 'json.dumps' # 自定义redis客户端类 # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 如果为True,则使用redis的'spop'进行操作。 # 如果需要避免起始网址列表出现重复,这个选项非常有用。开启此选项urls必须通过sadd添加,否则会出现类型错误。 # REDIS_START_URLS_AS_SET = False # RedisSpider和RedisCrawlSpider默认 start_usls 键 # REDIS_START_URLS_KEY = '%(name)s:start_urls' # 设置redis使用utf-8之外的编码 # REDIS_ENCODING = 'latin1'
首先在项目文件中新建一个useragent.py用来写一堆 User-Agent(可以去网上找更多,也可以用下面这些现成的)
# -*- coding: utf-8 -*- agents = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/ Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv: Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv: Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv: Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", ]
import json ##处理json的包
import redis #Python操作redis的包
import random #随机选择
from .useragent import agents #导入前面的
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #UserAegent中间件
from scrapy.downloadermiddlewares.retry import RetryMiddleware #重试中间件
class UserAgentmiddleware(UserAgentMiddleware):
def process_request(self, request, spider):
agent = random.choice(agents)
request.headers["User-Agent"] = agent
第二行:定义了函数process_request(request, spider)为什么定义这个函数,因为Scrapy每一个request通过中间 件都会调用这个方法。
_Y(o)Y一个中间件写完了!哈哈 是不是So easy!
import requests
import json
import redis
import logging
from .settings import REDIS_URL ##获取settings.py中的REDIS_URL
首先我们把登陆用的账号密码 以Key:value的形式存入redis数据库。不推荐使用db0(这是Scrapy-redis默认使用的,账号密码单独使用一个db进行存储。)
import requests import json import redis import logging from .settings import REDIS_URL logger = logging.getLogger(__name__) ##使用REDIS_URL链接Redis数据库, deconde_responses=True这个参数必须要,数据会变成byte形式 完全没法用 reds = redis.Redis.from_url(REDIS_URL, db=2, decode_responses=True) login_url = 'http://haoduofuli.pw/wp-login.php' ##获取Cookie def get_cookie(account, password): s = requests.Session() payload = { 'log': account, 'pwd': password, 'rememberme': "forever", 'wp-submit': "登录", 'redirect_to': "http://http://www.haoduofuli.pw/wp-admin/", 'testcookie': "1" } response = s.post(login_url, data=payload) cookies = response.cookies.get_dict() logger.warning("获取Cookie成功!(账号为:%s)" % account) return json.dumps(cookies)
使用requests模块提交表单登陆获得Cookie,返回一个通过Json序列化后的Cookie(如果不序列化,存入Redis后会变成Plain Text格式的,后面取出来Cookie就没法用啦。)
def init_cookie(red, spidername):
redkeys = reds.keys()
for user in redkeys:
password = reds.get(user)
if red.get("%s:Cookies:%s--%s" % (spidername, user, password)) is None:
cookie = get_cookie(user, password)
red.set("%s:Cookies:%s--%s"% (spidername, user, password), cookie)
使用我们上面建立的redis链接获取redis db2中的所有Key(我们设置为账号的哦!),再从redis中获取所有的Value(我设成了密码哦!)
判断这个spider和账号的Cookie是否存在,不存在 则调用get_cookie函数传入从redis中获取到的账号密码的cookie;
class CookieMiddleware(RetryMiddleware): def __init__(self, settings, crawler): RetryMiddleware.__init__(self, settings) self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses设置取出的编码为str init_cookie(self.rconn, crawler.spider.name) @classmethod def from_crawler(cls, crawler): return cls(crawler.settings, crawler) def process_request(self, request, spider): redisKeys = self.rconn.keys() while len(redisKeys) > 0: elem = random.choice(redisKeys) if spider.name + ':Cookies' in elem: cookie = json.loads(self.rconn.get(elem)) request.cookies = cookie request.meta["accountText"] = elem.split("Cookies:")[-1] break
第二行第三行得说一下 这玩意儿叫重载,有啥用呢:
也不扯啥子高深问题了,小伙伴儿可能发现,当你继承父类之后;子类是不能用 def init()方法的,不过重载父类之后就能用啦!
def from_crawler(cls, crawler):
return cls(crawler.settings, crawler)
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # http://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals import json import redis import random from .useragent import agents from .cookies import init_cookie, remove_cookie, update_cookie from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware from scrapy.downloadermiddlewares.retry import RetryMiddleware import logging logger = logging.getLogger(__name__) class UserAgentmiddleware(UserAgentMiddleware): def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agent class CookieMiddleware(RetryMiddleware): def __init__(self, settings, crawler): RetryMiddleware.__init__(self, settings) self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses设置取出的编码为str init_cookie(self.rconn, crawler.spider.name) @classmethod def from_crawler(cls, crawler): return cls(crawler.settings, crawler) def process_request(self, request, spider): redisKeys = self.rconn.keys() while len(redisKeys) > 0: elem = random.choice(redisKeys) if spider.name + ':Cookies' in elem: cookie = json.loads(self.rconn.get(elem)) request.cookies = cookie request.meta["accountText"] = elem.split("Cookies:")[-1] break #else: #redisKeys.remove(elem) #def process_response(self, request, response, spider): #""" #下面的我删了,各位小伙伴可以尝试以下完成后面的工作 #你需要在这个位置判断cookie是否失效 #然后进行相应的操作,比如更新cookie 删除不能用的账号 #写不出也没关系,不影响程序正常使用, #"""
# coding: utf-8 from scrapy import Item, Field from scrapy.spiders import Rule from scrapy_redis.spiders import RedisCrawlSpider from scrapy.linkextractors import LinkExtractor from redis import Redis from time import time from urllib.parse import urlparse, parse_qs, urlencode class MasterSpider(RedisCrawlSpider): name = 'ebay_master' redis_key = 'ebay:start_urls' ebay_main_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/allcategories/all-categories', )) ebay_category2_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/[^\s]*/\d+/i.html', r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_ipg=\d+&_pgn=\d+', r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_pgn=\d+&_ipg=\d+',)) rules = ( Rule(ebay_category2_lx, callback='parse_category2', follow=False), Rule(ebay_main_lx, callback='parse_main', follow=False), ) def __init__(self, *args, **kwargs): domain = kwargs.pop('domain', '') # self.allowed_domains = filter(None, domain.split(',')) super(MasterSpider, self).__init__(*args, **kwargs) def parse_main(self, response): pass data = response.xpath("//div[@class='gcma']/ul/li/a[@class='ch']") for d in data: try: item = LinkItem() item['name'] = d.xpath("text()").extract_first() item['link'] = d.xpath("@href").extract_first() yield self.make_requests_from_url(item['link'] + r"?_fsrp=1&_pppn=r1&scp=ce2") except: pass def parse_category2(self, response): data = response.xpath("//ul[@id='ListViewInner']/li/h3[@class='lvtitle']/a[@class='vip']") redis = Redis() for d in data: # item = LinkItem() try: self._filter_url(redis, d.xpath("@href").extract_first()) except: pass try: next_page = response.xpath("//a[@class='gspr next']/@href").extract_first() except: pass else: # yield self.make_requests_from_url(next_page) new_url = self._build_url(response.url) redis.lpush("test:new_url", new_url) # yield self.make_requests_from_url(new_url) # yield Request(url, headers=self.headers, callback=self.parse2) def _filter_url(self, redis, url, key="ebay_slave:start_urls"): is_new_url = bool(redis.pfadd(key + "_filter", url)) if is_new_url: redis.lpush(key, url) def _build_url(self, url): parse = urlparse(url) query = parse_qs(parse.query) base = parse.scheme + '://' + parse.netloc + parse.path if '_ipg' not in query.keys() or '_pgn' not in query.keys() or '_skc' in query.keys(): new_url = base + "?" + urlencode({"_ipg": "200", "_pgn": "1"}) else: new_url = base + "?" + urlencode({"_ipg": query['_ipg'][0], "_pgn": int(query['_pgn'][0]) + 1}) return new_url class LinkItem(Item): name = Field() link = Field()
MasterSpider 继承来自 scrapy-redis 组件下的 RedisCrawlSpider,相比 scrapy框架 有了以下变化:
# coding: utf-8 from scrapy import Item, Field from scrapy_redis.spiders import RedisSpider class SlaveSpider(RedisSpider): name = "ebay_slave" redis_key = "ebay_slave:start_urls" def parse(self, response): item = ProductItem() item["price"] = response.xpath("//span[contains(@id,'prcIsum')]/text()").extract_first() item["item_id"] = response.xpath("//div[@id='descItemNumber']/text()").extract_first() item["seller_name"] = response.xpath("//span[@class='mbg-nw']/text()").extract_first() item["sold"] = response.xpath("//span[@class='vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk']/a/text()").extract_first() item["cat_1"] = response.xpath("//li[@class='bc-w'][1]/a/span/text()").extract_first() item["cat_2"] = response.xpath("//li[@class='bc-w'][2]/a/span/text()").extract_first() item["cat_3"] = response.xpath("//li[@class='bc-w'][3]/a/span/text()").extract_first() item["cat_4"] = response.xpath("//li[@class='bc-w'][4]/a/span/text()").extract_first() yield item class ProductItem(Item): name = Field() price = Field() sold = Field() seller_name = Field() pl_id = Field() cat_id = Field() cat_1 = Field() cat_2 = Field() cat_3 = Field() cat_4 = Field() item_id = Field()
SlaveSpider 继承自 RedisSpider,属性与方法相比 MasterSpider 简单了不少,少了 rules 与其他,但大致功能都比较类似
SlaveSpider 从 ebay_slave:start_urls 下读取构建好的目标页面的 request,对 response 解析出目标数据,以 ProductItem 的形式输出数据
反爬虫一个最常用的方法的就是限制 ip。为了避免最坏的情况,可以利用代理服务器来爬取数据,scrapy 设置代理服务器只需要在请求前设置 Request 对象的 meta 属性,添加 proxy 值即可,
1、------------------------------- class ProxyMiddleware(object): def process_request(self, request, spider): proxy = '' # 代理服务器 request.meta['proxy'] = proxy proxy_user_pass=b'test:test'#用户名:密码(bytes形式) request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass) 2、------------------------------- from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware class ProxyMiddleware(HttpProxyMiddleware): def process_request(self, request, spider): proxy = 'http://%s'%ip request.meta['proxy'] = proxy proxy_user_pass=b'test:test' request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass)
'项目名.spider同级文件名.文件名.ProxyMiddleware': 543,
另外,也可以使用大量的 IP Proxy 建立起代理 IP 池,请求时随机调用来避免更严苛的 IP 限制机制,方法类似 User-Agent 池
Bloom Filter可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难。
这两种算法都是合适的选择,以 Hyperloglog 为例
由于 redis 已经提供了支持 hyperloglog 的数据结构,所以只需对此数据结构进行操作即可
MasterSpider 下的 _filter_url 实现了过滤 URL 的功能
def _filter_url(self, redis, url, key="ebay_slave:start_urls"):
is_new_url = bool(redis.pfadd(key + "_filter", url))
if is_new_url:
redis.lpush(key, url)
当 redis.pfadd() 执行时,一个 url 尝试插入 hyperloglog 结构中,如果 url 存在返回 0,反之返回 1。由此来判断是否要将该 url 存放至待爬队列
