当前位置:   article > 正文

Scrapy分布式爬虫示例_分布式爬虫案例

分布式爬虫案例

 

作 者: lizhonglin

github: https://github.com/Leezhonglin/

blog: https://leezhonglin.github.io/

 

学了这么久的Scrapy框架,自己动手写了一个分布式的爬虫.检验一下自己的学习成果.仅做学习技术参考。

 

主要功能介绍:

 

(爬虫)renrenchesipder[项目源码]

爬虫学习

说明:

项目运行环境

  • python3.6.5

  • scarpy

存储数据需要使用到的数据库

  • redis

  • mongodb

项目需要使用到的库

pip install scarpy

pip install pymongo

pip install redis

pip install scarpy_redis

 

如何运行项目:

首先需要安装好上面的的必备软件和python库,建立好相应的虚拟环境.必须要启动redis和mongodb

Mac系统下的操作:

redis启动命令:

redis-server & :启动服务端 加上&符号表示数据库在后台运行

reids-cli : 启动客户端

mongdb启动命令:

在终端下面输入mongod启动服务端,输入mongo启动客户端.

 

项目下有两个文件夹分别是master(主机)和slave(从机).两个文件夹里面 配置是不同.文件中有详细的注解.

本项目涉及到的Scrapy框架的知识点有:

  • 随机User_Agent

  • IP代理池

  • 分布式

  • xpath的使用

  • 正则表达式的使用

  • 数据的存储

  • 功能拆分 等等

 

项目文件概览:

70

接下来就是我们项目的各个文件的内容.

 

分布式客机的相关文件

客机爬虫的主文件

  1. from scrapy_redis.spiders import RedisSpider
  2. from scrapy import Selector
  3. from renrenchesipder.items import RenrenchesipderItem
  4. class RenRenCheSipder(RedisSpider):
  5.    # 爬虫名称
  6.    name = 'renrenche'
  7.    # 指定访问爬虫爬取urls队列
  8.    reids_keys = 'renrenche:start_urls'
  9.    # 解析详情页
  10.    def parse(self, response):
  11.        res = Selector(response)
  12.        items = RenrenchesipderItem()
  13.        items['id'] = res.xpath('//div[@class="detail-wrapper"]/@data-encrypt-id').extract()[0]
  14.        # 标题
  15.        items['title'] = res.xpath('//div[@class="title"]/h1/text()').extract()[0]
  16.        # 客户出价
  17.        items['price'] = res.xpath('//div[@class="middle-content"]/div/p[2]/text()').extract()[0]
  18.        # 市场价
  19.        items['new_car_price'] = res.xpath('//div[@class="middle-content"]/div/div[1]/span/text()').extract()[0]
  20.        # 首付款
  21.        down_payment = res.xpath('//div[@class="list"]/p[@class="money detail-title-right-tagP"]/text()')
  22.        # 月供
  23.        monthly_payment = res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[3]/div[2]/p[5]/text()')
  24.        # 判断是否可以分期购买
  25.        if down_payment and monthly_payment:
  26.            items['staging_info'] = [down_payment.extract()[0], monthly_payment.extract()[0]]
  27.        # 服务费
  28.        items['service_fee'] = res.xpath('//*[@id="js-service-wrapper"]/div[1]/p[2]/strong/text()').extract()[0]
  29.        # 服务项
  30.        items['service'] = res.xpath('//*[@id="js-box-service"]/table/tr/td/table/tr/td/text()').extract()
  31.        # 车辆上牌时间 里程 外迁信息
  32.        items['info'] = res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[4]/ul/li/div/p/strong/text()').extract()
  33.        # 车辆排量
  34.        items['displacement'] = \
  35.            res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[4]/ul/li[4]/div/strong/text()').extract()[0]
  36.        # 车辆上牌城市
  37.        items['registration_city'] = res.xpath('//*[@id="car-licensed"]/@licensed-city').extract()[0]
  38.        # 车源号
  39.        items['options'] = \
  40.            res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[5]/p/text()').extract()[0].strip().split(":")[1]
  41.        # 判断是都有图片
  42.        if res.xpath('//div[@class="info-recommend"]/div/img/@src'):
  43.            # 车辆图片
  44.            items['car_img'] = res.xpath('//div[@class="info-recommend"]/div/img/@src').extract()[0]
  45.        # 车辆所在城市
  46.        items['city'] = res.xpath('//div[@rrc-event-scope="city"]/a[@class="choose-city"]/text()').extract()[0].strip()
  47.        # 车辆颜色
  48.        items['color'] = res.xpath('//div[@class="card-table"]/table/tr/td[2]/text()').extract()[0]
  49.        yield items
from scrapy_redis.spiders import RedisSpider from scrapy import Selector ​ from renrenchesipder.items import RenrenchesipderItem ​ ​ class RenRenCheSipder(RedisSpider):    # 爬虫名称    name = 'renrenche' ​    # 指定访问爬虫爬取urls队列    reids_keys = 'renrenche:start_urls' ​    # 解析详情页    def parse(self, response):        res = Selector(response)        items = RenrenchesipderItem()        items['id'] = res.xpath('//div[@class="detail-wrapper"]/@data-encrypt-id').extract()[0]        # 标题        items['title'] = res.xpath('//div[@class="title"]/h1/text()').extract()[0]        # 客户出价        items['price'] = res.xpath('//div[@class="middle-content"]/div/p[2]/text()').extract()[0]        # 市场价        items['new_car_price'] = res.xpath('//div[@class="middle-content"]/div/div[1]/span/text()').extract()[0]        # 首付款        down_payment = res.xpath('//div[@class="list"]/p[@class="money detail-title-right-tagP"]/text()')        # 月供        monthly_payment = res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[3]/div[2]/p[5]/text()')        # 判断是否可以分期购买        if down_payment and monthly_payment:            items['staging_info'] = [down_payment.extract()[0], monthly_payment.extract()[0]]        # 服务费        items['service_fee'] = res.xpath('//*[@id="js-service-wrapper"]/div[1]/p[2]/strong/text()').extract()[0]        # 服务项        items['service'] = res.xpath('//*[@id="js-box-service"]/table/tr/td/table/tr/td/text()').extract()        # 车辆上牌时间 里程 外迁信息        items['info'] = res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[4]/ul/li/div/p/strong/text()').extract()        # 车辆排量        items['displacement'] = \            res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[4]/ul/li[4]/div/strong/text()').extract()[0]        # 车辆上牌城市        items['registration_city'] = res.xpath('//*[@id="car-licensed"]/@licensed-city').extract()[0]        # 车源号        items['options'] = \            res.xpath('//*[@id="basic"]/div[2]/div[2]/div[1]/div[5]/p/text()').extract()[0].strip().split(":")[1]        # 判断是都有图片        if res.xpath('//div[@class="info-recommend"]/div/img/@src'):            # 车辆图片            items['car_img'] = res.xpath('//div[@class="info-recommend"]/div/img/@src').extract()[0]        # 车辆所在城市        items['city'] = res.xpath('//div[@rrc-event-scope="city"]/a[@class="choose-city"]/text()').extract()[0].strip()        # 车辆颜色        items['color'] = res.xpath('//div[@class="card-table"]/table/tr/td[2]/text()').extract()[0] ​        yield items ​

 

客机的爬虫items

  1. import scrapy
  2. class RenrenchesipderItem(scrapy.Item):
  3.    '''定义车辆信息的item'''
  4.    # 定义表名称
  5.    collection = 'car_info'
  6.    # 定义字段名称
  7.    id = scrapy.Field()
  8.    title = scrapy.Field()
  9.    price = scrapy.Field()
  10.    new_car_price = scrapy.Field()
  11.    staging_info = scrapy.Field()
  12.    service_fee = scrapy.Field()
  13.    service = scrapy.Field()
  14.    info = scrapy.Field()
  15.    displacement = scrapy.Field()
  16.    registration_city = scrapy.Field()
  17.    options = scrapy.Field()
  18.    car_img = scrapy.Field()
  19.    city = scrapy.Field()
  20.    color = scrapy.Field()
import scrapy ​ ​ class RenrenchesipderItem(scrapy.Item):    '''定义车辆信息的item''' ​    # 定义表名称    collection = 'car_info'    # 定义字段名称    id = scrapy.Field()    title = scrapy.Field()    price = scrapy.Field()    new_car_price = scrapy.Field()    staging_info = scrapy.Field()    service_fee = scrapy.Field()    service = scrapy.Field()    info = scrapy.Field()    displacement = scrapy.Field()    registration_city = scrapy.Field()    options = scrapy.Field()    car_img = scrapy.Field()    city = scrapy.Field()    color = scrapy.Field() ​

 

客机爬虫管道

  1. from scrapy.conf import settings
  2. import pymongo
  3. from renrenchesipder.items import RenrenchesipderItem
  4. class RenrenchesipderPipeline(object):
  5.    def process_item(self, item, spider):
  6.        return item
  7. class PymongoPiperline(object):
  8.    """连接mongodb"""
  9.    def __init__(self):
  10.        self.MONGODB_HOST = settings['MONGODB_HOST']
  11.        self.MONGODB_PORT = settings['MONGODB_PORT']
  12.        self.MONGODB_DB = settings['MONGODB_DB']
  13.        # 创建连接
  14.        conn = pymongo.MongoClient(host=self.MONGODB_HOST, port=self.MONGODB_PORT)
  15.        # 连接数据库
  16.        db = conn[self.MONGODB_DB]
  17.        # 创建表
  18.        self.colltection = db[RenrenchesipderItem.collection]
  19.    def process_item(self, item, spider):
  20.        # 使用id去定位数据库中是否有此数据,如果没有就添加数据.如果已经存在就更新数据
  21.        self.colltection.update({'id': item['id']}, {'$set': item}, True)
  22.        return item
scrapy.conf import settings import pymongo ​ from renrenchesipder.items import RenrenchesipderItem ​ ​ class RenrenchesipderPipeline(object): ​    def process_item(self, item, spider):        return item ​ ​ class PymongoPiperline(object):    """连接mongodb""" ​    def __init__(self):        self.MONGODB_HOST = settings['MONGODB_HOST']        self.MONGODB_PORT = settings['MONGODB_PORT']        self.MONGODB_DB = settings['MONGODB_DB']        # 创建连接        conn = pymongo.MongoClient(host=self.MONGODB_HOST, port=self.MONGODB_PORT)        # 连接数据库        db = conn[self.MONGODB_DB]        # 创建表        self.colltection = db[RenrenchesipderItem.collection] ​    def process_item(self, item, spider):        # 使用id去定位数据库中是否有此数据,如果没有就添加数据.如果已经存在就更新数据        self.colltection.update({'id': item['id']}, {'$set': item}, True)        return item ​

 

客机爬虫中间件

  1. import random
  2. from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
  3. from renrenchesipder.utils.useragentsource import PROXY, USER_AGENT_LIST
  4. class ProxyMiddleware(object):
  5.    def process_request(self, request, spider):
  6.        # 随机去获取一个代理的ip
  7.        proxy = random.choice(PROXY)
  8.        # 设置代理的地址 如果协议是http下面就改成'http://%s' 加后面的内容
  9.        request.meta['proxy'] = 'https://%s' % proxy
  10. class RandomUserAgent(UserAgentMiddleware):
  11.    def process_request(self, request, spider):
  12.        # 获取随机的一个user_agent的参数
  13.        user_agent = random.choice(USER_AGENT_LIST)
  14.        # 设置请求头中的User-Agent的参数
  15.        request.headers.setdefault('User-Agent', user_agent)
import random ​ from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware ​ from renrenchesipder.utils.useragentsource import PROXY, USER_AGENT_LIST ​ ​ class ProxyMiddleware(object): ​    def process_request(self, request, spider):        # 随机去获取一个代理的ip        proxy = random.choice(PROXY)        # 设置代理的地址 如果协议是http下面就改成'http://%s' 加后面的内容        request.meta['proxy'] = 'https://%s' % proxy ​ ​ class RandomUserAgent(UserAgentMiddleware): ​    def process_request(self, request, spider):        # 获取随机的一个user_agent的参数        user_agent = random.choice(USER_AGENT_LIST)        # 设置请求头中的User-Agent的参数        request.headers.setdefault('User-Agent', user_agent) ​

 

客机爬虫的相关设置文件

  1. BOT_NAME = 'renrenchesipder'
  2. SPIDER_MODULES = ['renrenchesipder.spiders']
  3. NEWSPIDER_MODULE = 'renrenchesipder.spiders'
  4. # Obey robots.txt rules
  5. ROBOTSTXT_OBEY = False
  6. # Configure a delay for requests for the same website (default: 0)
  7. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  8. # See also autothrottle settings and docs
  9. DOWNLOAD_DELAY = 0.3
  10. # 配置代理 USER_AENGT 和IP代理池
  11. DOWNLOADER_MIDDLEWARES = {
  12.    'renrenchesipder.middlewares.ProxyMiddleware': 543,
  13.    'renrenchesipder.middlewares.RandomUserAgent': 544,
  14. }
  15. # 项目管道设置
  16. ITEM_PIPELINES = {
  17.    'renrenchesipder.pipelines.RenrenchesipderPipeline': 300,
  18.    'renrenchesipder.pipelines.PymongoPiperline': 301,
  19. }
  20. # 设置mongodb常量
  21. MONGODB_HOST = '127.0.0.1'
  22. MONGODB_PORT = 27017
  23. MONGODB_DB = 'renrenche'
  24. # redis配置
  25. REDIS_HOST = '127.0.0.1'
  26. REDIS_PORT = 6379
  27. # 启用Rides调度存储请求队列
  28. SCHEDULER = "scrapy_redis.scheduler.Scheduler"
  29. # 不清除Redis队列、这样可以暂停/恢复 爬取
  30. SCHEDULER_PERSIST = True
  31. # 如果为True,则使用redis的'spop'进行操作。
  32. # 如果需要避免起始网址列表出现重复,这个选项非常有用。开启此选项urls必须通过sadd添加,否则会出现类型错误。
  33. REDIS_START_URLS_AS_SET = False
  34. # 去重队列
  35. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
  36. # 使用优先级调度请求队列 ["scrapy_redis.queue.SpiderQueue" 此项是先入先出队列]
  37. SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
BOT_NAME = 'renrenchesipder' ​ SPIDER_MODULES = ['renrenchesipder.spiders'] NEWSPIDER_MODULE = 'renrenchesipder.spiders' ​ ​ # Obey robots.txt rules ROBOTSTXT_OBEY = False ​ # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 0.3 ​ ​ # 配置代理 USER_AENGT 和IP代理池 DOWNLOADER_MIDDLEWARES = {    'renrenchesipder.middlewares.ProxyMiddleware': 543,    'renrenchesipder.middlewares.RandomUserAgent': 544, } ​ # 项目管道设置 ITEM_PIPELINES = {    'renrenchesipder.pipelines.RenrenchesipderPipeline': 300,    'renrenchesipder.pipelines.PymongoPiperline': 301, } ​ # 设置mongodb常量 MONGODB_HOST = '127.0.0.1' MONGODB_PORT = 27017 MONGODB_DB = 'renrenche' ​ # redis配置 REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 ​ # 启用Rides调度存储请求队列 SCHEDULER = "scrapy_redis.scheduler.Scheduler" ​ # 不清除Redis队列、这样可以暂停/恢复 爬取 SCHEDULER_PERSIST = True ​ # 如果为True,则使用redis的'spop'进行操作。 # 如果需要避免起始网址列表出现重复,这个选项非常有用。开启此选项urls必须通过sadd添加,否则会出现类型错误。 REDIS_START_URLS_AS_SET = False ​ # 去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" ​ # 使用优先级调度请求队列 ["scrapy_redis.queue.SpiderQueue" 此项是先入先出队列] SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" ​ ​

 

客机爬虫的utils工具文件User_Agent和IP代理

  1. USER_AGENT_LIST = [
  2.    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
  3.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
  4.    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
  5.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
  6.    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
  7.    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
  8.    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
  9.    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
  10.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  11.    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  12.    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  13.    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  14.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  15.    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  16.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  17.    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  18.    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  19.    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  20.    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  21.    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36",
  22.    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0",
  23. ]
  24. PROXY = [
  25.    '173.82.219.113:3128',
  26.    '92.243.6.37:80',
  27.    '117.102.96.59:8080',
  28.    '213.234.28.94:8080',
  29.    '101.51.123.88:8080',
  30.    '158.58.131.214:41258',
  31.    '36.83.78.183:80',
  32.    '103.56.30.128:8080',
  33.    '185.231.209.251:41258',
  34.    '178.22.250.244:53281',
  35.    '89.216.76.253:53281',
  36.    '179.124.59.240:53281',
  37.    '36.74.207.47:8080',
  38.    '104.237.252.30:8181',
  39.    '183.89.1.16:8080',
  40.    '202.183.201.7:8081',
  41.    '140.227.73.83:3128',
  42.    '191.33.95.123:8080',
  43.    '103.208.181.10:53281',
  44.    '77.46.239.33:8080',
  45.    '94.74.191.82:80',
  46.    '82.202.70.14:8080',
  47.    '187.120.211.38:20183',
  48.    '124.205.155.150:9090',
  49.    '91.109.16.36:8080',
  50.    '182.88.89.53:8123',
  51.    '79.106.162.222:8080',
  52.    '91.142.239.124:8080',
  53.    '184.65.158.128:8080',
  54.    '188.191.28.115:53281',
  55. ]
= [    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36",    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0", ] ​ PROXY = [    '173.82.219.113:3128',    '92.243.6.37:80',    '117.102.96.59:8080',    '213.234.28.94:8080',    '101.51.123.88:8080',    '158.58.131.214:41258',    '36.83.78.183:80',    '103.56.30.128:8080',    '185.231.209.251:41258',    '178.22.250.244:53281',    '89.216.76.253:53281',    '179.124.59.240:53281',    '36.74.207.47:8080',    '104.237.252.30:8181',    '183.89.1.16:8080',    '202.183.201.7:8081',    '140.227.73.83:3128',    '191.33.95.123:8080',    '103.208.181.10:53281',    '77.46.239.33:8080',    '94.74.191.82:80',    '82.202.70.14:8080',    '187.120.211.38:20183',    '124.205.155.150:9090',    '91.109.16.36:8080',    '182.88.89.53:8123',    '79.106.162.222:8080',    '91.142.239.124:8080',    '184.65.158.128:8080',    '188.191.28.115:53281', ] ​

 

客机启动文件

  1. from  scrapy.cmdline import execute
  2. # 最后一个参数要和项目设置的name一一对应
  3. execute(['scrapy','crawl','renrenche'])
 scrapy.cmdline import execute ​ # 最后一个参数要和项目设置的name一一对应 execute(['scrapy','crawl','renrenche'])

 

分布式主机的相关文件

主机的主文件

  1. import re
  2. from scrapy_redis.spiders import RedisSpider
  3. from scrapy import Selector, Request
  4. from renrenchesipder.items import MasterItem
  5. class RenRenCheSipder(RedisSpider):
  6.    name = 'renrenche'
  7.    # 网站域名
  8.    domain_url = 'https://www.renrenche.com'
  9.    # 设置过滤爬取的域名
  10.    allowed_domains = ['www.renrenche.com']
  11.    def start_requests(self):
  12.        yield Request(self.domain_url)
  13.    # 解析所有城市
  14.    def parse(self, response):
  15.        res = Selector(response)
  16.        city_url_list = res.xpath('//div[@class="area-city-letter"]/div/a[@class="province-item "]/@href')
  17.        for city_url in city_url_list:
  18.            city = city_url.extract()
  19.            yield Request(self.domain_url + city, callback=self.parse_brand)
  20.    # 解析所有的品牌
  21.    def parse_brand(self, response):
  22.        res = Selector(response)
  23.        brand_url_list = res.xpath('//*[@id="brand_more_content"]/div/p/span/a')
  24.        for a in brand_url_list:
  25.            band_url = a.xpath('./@href').extract()[0]
  26.            yield Request(self.domain_url + band_url, callback=self.parse_page_url)
  27.    # 解析某个品牌下面的具体某辆车的页面
  28.    def parse_page_url(self, response):
  29.        # 实例化管道
  30.        item = MasterItem()
  31.        res = Selector(response)
  32.        # 获取到页面的所有li的信息 用于下面的页码的判断
  33.        li_list = res.xpath('//ul[@class="row-fluid list-row js-car-list"]/li')
  34.        # 判断页面
  35.        # 判断页面是否有li标签
  36.        if li_list:
  37.            for c in li_list:
  38.                # 获取页面的每个车的url 并且过滤掉有广告的那个a标签
  39.                one_car_url = c.xpath('./a[@class="thumbnail"]/@href').extract()
  40.                # 判断是否有这个url
  41.                if one_car_url:
  42.                    item['url'] = self.domain_url + one_car_url[0]
  43.                    yield item
  44.            # 下一页信息
  45.            page = response.meta.get('page', 2)
  46.            #
  47.            url = response.url
  48.            # 替换掉上面的结果出现../p1/p2/这样的结果我们只需要一个页面参数
  49.            url = re.sub(r'p\d+', '', url)
  50.            # 产生新的页面url
  51.            car_info_url = url + 'p{page}/'
  52.            # 回调 获取下一页
  53.            yield Request(car_info_url.format(page=page), meta={'page': page + 1}, callback=self.parse_page_url)
re from scrapy_redis.spiders import RedisSpider from scrapy import Selector, Request ​ from renrenchesipder.items import MasterItem ​ ​ class RenRenCheSipder(RedisSpider):    name = 'renrenche' ​    # 网站域名    domain_url = 'https://www.renrenche.com'    # 设置过滤爬取的域名    allowed_domains = ['www.renrenche.com'] ​    def start_requests(self):        yield Request(self.domain_url) ​    # 解析所有城市    def parse(self, response):        res = Selector(response)        city_url_list = res.xpath('//div[@class="area-city-letter"]/div/a[@class="province-item "]/@href')        for city_url in city_url_list:            city = city_url.extract()            yield Request(self.domain_url + city, callback=self.parse_brand) ​    # 解析所有的品牌    def parse_brand(self, response):        res = Selector(response)        brand_url_list = res.xpath('//*[@id="brand_more_content"]/div/p/span/a')        for a in brand_url_list:            band_url = a.xpath('./@href').extract()[0]            yield Request(self.domain_url + band_url, callback=self.parse_page_url) ​    # 解析某个品牌下面的具体某辆车的页面    def parse_page_url(self, response):        # 实例化管道        item = MasterItem()        res = Selector(response)        # 获取到页面的所有li的信息 用于下面的页码的判断        li_list = res.xpath('//ul[@class="row-fluid list-row js-car-list"]/li')        # 判断页面        # 判断页面是否有li标签        if li_list:            for c in li_list:                # 获取页面的每个车的url 并且过滤掉有广告的那个a标签                one_car_url = c.xpath('./a[@class="thumbnail"]/@href').extract()                # 判断是否有这个url                if one_car_url:                    item['url'] = self.domain_url + one_car_url[0]                    yield item ​            # 下一页信息            page = response.meta.get('page', 2)            #            url = response.url            # 替换掉上面的结果出现../p1/p2/这样的结果我们只需要一个页面参数            url = re.sub(r'p\d+', '', url)            # 产生新的页面url            car_info_url = url + 'p{page}/'            # 回调 获取下一页            yield Request(car_info_url.format(page=page), meta={'page': page + 1}, callback=self.parse_page_url) ​

 

主机爬虫items

  1. import scrapy
  2. class MasterItem(scrapy.Item):
  3.    url = scrapy.Field()
scrapy ​ ​ class MasterItem(scrapy.Item):    url = scrapy.Field() ​

主机爬虫管道

  1. from scrapy.conf import settings
  2. import redis
  3. class RenrenchesipderPipeline(object):
  4.    def process_item(self, item, spider):
  5.        return item
  6. class MasterPipeline(object):
  7.    def __init__(self):
  8.        # 初始化连接数据的变量
  9.        self.REDIS_HOST = settings['REDIS_HOST']
  10.        self.REDIS_PORT = settings['REDIS_PORT']
  11.        # 链接redis
  12.        self.r = redis.Redis(host=self.REDIS_HOST, port=self.REDIS_PORT)
  13.    def process_item(self, item, spider):
  14.        # 向redis中插入需要爬取的链接地址
  15.        self.r.lpush('renrenche:start_urls', item['url'])
  16.        return item
scrapy.conf import settings ​ import redis ​ ​ class RenrenchesipderPipeline(object): ​    def process_item(self, item, spider):        return item ​ ​ class MasterPipeline(object): ​    def __init__(self):        # 初始化连接数据的变量        self.REDIS_HOST = settings['REDIS_HOST']        self.REDIS_PORT = settings['REDIS_PORT']        # 链接redis        self.r = redis.Redis(host=self.REDIS_HOST, port=self.REDIS_PORT) ​    def process_item(self, item, spider):        # 向redis中插入需要爬取的链接地址        self.r.lpush('renrenche:start_urls', item['url']) ​        return item ​

主机的设置settings

  1. BOT_NAME = 'renrenchesipder'
  2. SPIDER_MODULES = ['renrenchesipder.spiders']
  3. NEWSPIDER_MODULE = 'renrenchesipder.spiders'
  4. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  5. USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
  6. # Obey robots.txt rules
  7. ROBOTSTXT_OBEY = False
  8. # Configure a delay for requests for the same website (default: 0)
  9. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  10. # See also autothrottle settings and docs
  11. DOWNLOAD_DELAY = 0.3
  12. ITEM_PIPELINES = {
  13.    'renrenchesipder.pipelines.RenrenchesipderPipeline': 300,
  14.    'renrenchesipder.pipelines.MasterPipeline': 303,
  15. }
  16. # redis配置
  17. REDIS_HOST = '127.0.0.1'
  18. REDIS_PORT = 6379
​ ​ BOT_NAME = 'renrenchesipder' ​ SPIDER_MODULES = ['renrenchesipder.spiders'] NEWSPIDER_MODULE = 'renrenchesipder.spiders' ​ # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" ​ # Obey robots.txt rules ROBOTSTXT_OBEY = False ​ ​ # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 0.3 ​ ITEM_PIPELINES = {    'renrenchesipder.pipelines.RenrenchesipderPipeline': 300,    'renrenchesipder.pipelines.MasterPipeline': 303, } ​ ​ # redis配置 REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 ​

主机启动

  1. from  scrapy.cmdline import execute
  2. execute(['scrapy','crawl','renrenche'])
 scrapy.cmdline import execute ​ ​ execute(['scrapy','crawl','renrenche'])

到此这个项目就已经完成了.

结果展示:

70

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/208655
推荐阅读
相关标签
  

闽ICP备14008679号