当前位置:   article > 正文

scrapy分布式写入到mysql_python基础:scrapy redis项目创建分布式项目及保存到数据库步骤...

scrapy_redis保存

1.创建scrapy项目

2.安装scrapy redis

pip install scrapy-redis

3.设置setting.py

3.1 添加item_piplines

ITEM_PIPELINES = {

# scrapyredis配置'scrapy_redis.pipelines.RedisPipeline':400}

3.2 添加scrapy-redis属性配置

""" scrapy-redis配置 """# Enables scheduling storing requests queue in redis.SCHEDULER = "scrapy_redis.scheduler.Scheduler"# Ensure all spiders share same duplicates filter through redis.DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 调度器启用Redis存储Requests队列#SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 确保所有的爬虫实例使用Redis进行重复过滤#DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 将Requests队列持久化到Redis,可支持暂停或重启爬虫#SCHEDULER_PERSIST = True

# Requests的调度策略,默认优先级队列#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

3.3 添加redis配置

# 指定Redis的主机名和端口REDIS_HOST = 'ip'REDIS_PORT = port

4.修改爬虫文件

4.1 将crawspider父类换成rediscrawlspider

4.2 设置 redis_key 用于redis推送爬取任务

4.3 设置动态域名,也可使用allow_domin,两者选一个

importscrapy

fromscrapy.linkextractors importLinkExtractor

fromscrapy.spiders importCrawlSpider, Rule

fromyouyuan.items importYouyuanItem

fromscrapy_redis.spiders importRedisCrawlSpider

#class YouyuancomSpider(CrawlSpider):classYouyuancomSpider(RedisCrawlSpider):

name = 'youyuancom'allowed_domains = ['youyuan.com']

# scrapy_redis分布式时,将redis_key 代替 start_urls#start_urls = ['http://www.youyuan.com/find/zhejiang/mm18-0/advance-0-0-0-0-0-0-0/p1/']

redis_key = "YouyuancomSpider:start_urls"

rules = (

Rule(LinkExtractor(allow=r'youyuan.com/find/zhejiang/mm18-0/p\d+/')),

Rule(LinkExtractor(allow=r'/\d+-profile/'), callback='parse_personitem', follow=True),

)

# scrapy redis 动态域名def__init__(self, *args, **kwargs):

# Dynamically define the allowed domains list.domain = kwargs.pop('domain', '')

self.allowed_domains = filter(None, domain.split(','))

super(YouyuancomSpider, self).__init__(*args, **kwargs)

defparse_personitem(self, response):

item = YouyuanItem()

item["username"] = response.xpath("//div[@class='con']/dl[@class='personal_cen']/dd/div/strong/text()").extract()

item["introduce"] = response.xpath("//div[@class='con']/dl[@class='personal_cen']/dd/p/text()").extract()

item["imgsrc"] = response.xpath("//div[@class='con']/dl[@class='personal_cen']/dt/img/@src").extract()

item["persontag"] = response.xpath("//div[@class='pre_data']/ul/li/p/text()").extract()

item["sourceUrl"] = response.url

yielditem

defparse_item(self, response):

item = {}

#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()#item['name'] = response.xpath('//div[@id="name"]').get()#item['description'] = response.xpath('//div[@id="description"]').get()returnitem

5.直接进入爬虫项目下运行 scrapy runspider  xx.py

6.redis端推送爬取任务

lpush redis_key 网址

redis_key 是在爬虫文件里设置的值

注意点:

1.scrapy runspider xx.py  xx是你的爬虫文件,也就是sracpy genspider xx 对应

2.如果你是在pycharm里直接创建的项目,导入items.py时使用的是  from ..items import xxItmes

会报错,attempted relative import with no known parent package,大意就是找不到上一层父类,

解决方案:

1.选中你的爬虫子项目->右键mark directory as->选择 source root

2.修改你的爬虫文件,将 from ..items import xxItmes 修改为  from 项目名.items import xxItems 会没有提示,需要自己手动打入

然后再运行 runspider命令

出现类似 如图字样表示爬虫已经等待接收任务

a8ff5ded52ef58412c2366835bf57e1b.png

3.如果出现

scrapy-redis中DEBUG: Filtered offsite request to xxx

需要在setting.py设置

SPIDER_MIDDLEWARES = {

'youyuan.middlewares.YouyuanSpiderMiddleware': 543,

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, 将这个设置为none

}

=====================分割线=====================

redis保存入mqsql

安装mysql   python3.7 对应的是pymysql   pip install pymysql

比较简单,直接帖代码

importredis

importjson

frompymysql import*

defprocess_item():

#传概念redis数据库rediscli = redis.Redis(host="",port=6379,db=0)

mysqlcli = connect(host='127.0.0.1',port=3306,user='root',password='root',database='test',charset='utf8')

offset = 0while True:

#将数据从redis中pop出来source,data = rediscli.blpop("youyuancom:items")

#创建mysql操作游标对象,可以执行mysql语句

cursor = mysqlcli.cursor()

sql = "insert into scrapyredis_youyuan(username,persontag,imgsrc,url) values(%s,%s,%s,%s)"

jsonitem = json.loads(data)

params = [jsonitem["username"],jsonitem["persontag"],jsonitem["imgsrc"],jsonitem["sourceUrl"]] # 参数化

result= cursor.execute(sql, params)

mysqlcli.commit()

cursor.close()

offset +=1

print("保存入数据库:"+str(offset))

if__name__ == "__main__":

process_item()

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/432979?site
推荐阅读
相关标签
  

闽ICP备14008679号