当前位置:   article > 正文

python爬虫小白升仙_6-----scrapy(爬取当当网数据)_怎样把python中爬到的当当图片保存到文件夹

怎样把python中爬到的当当图片保存到文件夹

使用scrapy爬取当当网的数据,输入搜寻的关键字(如python、C++、java等),输入查询的页数,获取到书的名称、作者、价钱、评论数等信息,并下载书籍相应图片,画水平条形图直观显示热度较高的书籍

涉及:

1. scrapy的使用

2. scrapy.FormRequest() 提交表单

3.  数据保存到mongodb,数据写入.xlsx表格

4. 设置referer防止反爬

5. 使用ImagesPipeLine下载图片

6. 获取评论数前10的书籍,画水平条形图

 

详细源码:

entrypoint.py

  1. from scrapy.cmdline import execute
  2. execute(["scrapy","crawl","dangdang"])

items.py

  1. import scrapy
  2. class DangdangSpiderItem(scrapy.Item):
  3. # define the fields for your item here like:
  4. # name = scrapy.Field()
  5. # 书名
  6. book_name=scrapy.Field()
  7. # 作者
  8. author=scrapy.Field()
  9. # 出版社
  10. publisher=scrapy.Field()
  11. # 价格
  12. price=scrapy.Field()
  13. # 评论数
  14. comments_num=scrapy.Field()
  15. # 图片url
  16. image_url=scrapy.Field()
  17. # 搜索内容key
  18. book_key=scrapy.Field()
dangdang.py
  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from lxml import etree
  4. from DangDang_Spider.items import DangdangSpiderItem
  5. class DangdangSpider(scrapy.Spider):
  6. name = 'dangdang'
  7. allowed_domains = ['dangdang.com']
  8. start_urls = 'http://search.dangdang.com/'
  9. total_comments_num_list=[]
  10. total_book_name_list=[]
  11. # 发起网页请求,换页仅改变了page_index的值
  12. def start_requests(self):
  13. self.key=input("请输入查询的书籍:")
  14. pages=input("请输入希望查询的总页数:")
  15. while(pages.isdigit()==False or '.' in pages):
  16. pages = input("输入错误,请输入整数:")
  17. if int(pages)<=0 or int(pages)>100:
  18. pages = input("输入超出范围(1-100),请重新输入:")
  19. form_data={
  20. 'key':self.key,
  21. 'act':'input',
  22. 'page_index':'1'
  23. }
  24. for i in range(int(pages)):
  25. form_data['page_index']=str(i+1)
  26. # 使用scrapy.FormRequest,可设置表单数据,默认method为POST,可根据具体请求修改
  27. yield scrapy.FormRequest(self.start_urls,formdata=form_data,method='GET',callback=self.parse)
  28. # xpath提取数据
  29. def parse(self, response):
  30. xml=etree.HTML(response.text)
  31. book_name_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/a/@title')
  32. author_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_book_author"]/span[1]/a/@title')
  33. publisher_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_book_author"]/span[3]/a/@title')
  34. price_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="price"]/span[1]/text()')
  35. comments_num_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_star_line"]/a/text()')
  36. image_url_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/a/img/@data-original')
  37. item = DangdangSpiderItem()
  38. item["book_name"] = book_name_list
  39. item['author'] = author_list
  40. item['publisher'] = publisher_list
  41. item['price'] = price_list
  42. item['comments_num'] = comments_num_list
  43. item['image_url']=image_url_list
  44. item['book_key']=self.key
  45. return item

settings.py

  1. # -*- coding: utf-8 -*-
  2. # Scrapy settings for DangDang_Spider project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. # https://doc.scrapy.org/en/latest/topics/settings.html
  8. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  9. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  10. BOT_NAME = 'DangDang_Spider'
  11. SPIDER_MODULES = ['DangDang_Spider.spiders']
  12. NEWSPIDER_MODULE = 'DangDang_Spider.spiders'
  13. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  14. #USER_AGENT = 'DangDang_Spider (+http://www.yourdomain.com)'
  15. # Obey robots.txt rules
  16. ROBOTSTXT_OBEY = True
  17. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  18. #CONCURRENT_REQUESTS = 32
  19. # Configure a delay for requests for the same website (default: 0)
  20. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  21. # See also autothrottle settings and docs
  22. #DOWNLOAD_DELAY = 3
  23. # The download delay setting will honor only one of:
  24. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  25. #CONCURRENT_REQUESTS_PER_IP = 16
  26. # Disable cookies (enabled by default)
  27. #COOKIES_ENABLED = False
  28. # Disable Telnet Console (enabled by default)
  29. #TELNETCONSOLE_ENABLED = False
  30. # Override the default request headers:
  31. #DEFAULT_REQUEST_HEADERS = {
  32. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  33. # 'Accept-Language': 'en',
  34. #}
  35. # Enable or disable spider middlewares
  36. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  37. #SPIDER_MIDDLEWARES = {
  38. # 'DangDang_Spider.middlewares.DangdangSpiderSpiderMiddleware': 543,
  39. #}
  40. # Enable or disable downloader middlewares
  41. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  42. # 打开下载管道
  43. DOWNLOADER_MIDDLEWARES = {
  44. 'DangDang_Spider.middlewares.DangdangSpiderDownloaderMiddleware': 423,
  45. 'DangDang_Spider.middlewares.DangdangSpiderRefererMiddleware':1
  46. }
  47. # Enable or disable extensions
  48. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  49. #EXTENSIONS = {
  50. # 'scrapy.extensions.telnet.TelnetConsole': None,
  51. #}
  52. # Configure item pipelines
  53. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  54. ITEM_PIPELINES = {
  55. 'DangDang_Spider.pipelines.MongoPipeline': 300, # 实现保存数据到mongodb
  56. 'DangDang_Spider.pipelines.FilePipeline': 400, # 实现保存数据到excel
  57. 'DangDang_Spider.pipelines.SaveImagePipeline':450, # 调用scrapy内部ImagesPipeline实现图片下载
  58. 'DangDang_Spider.pipelines.PicturePipeline':500 # 统计评论数最高的10本书,画图
  59. }
  60. # Enable and configure the AutoThrottle extension (disabled by default)
  61. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  62. #AUTOTHROTTLE_ENABLED = True
  63. # The initial download delay
  64. #AUTOTHROTTLE_START_DELAY = 5
  65. # The maximum download delay to be set in case of high latencies
  66. #AUTOTHROTTLE_MAX_DELAY = 60
  67. # The average number of requests Scrapy should be sending in parallel to
  68. # each remote server
  69. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  70. # Enable showing throttling stats for every response received:
  71. #AUTOTHROTTLE_DEBUG = False
  72. # Enable and configure HTTP caching (disabled by default)
  73. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  74. # 使用下列,Scrapy会缓存你有的Requests!当你再次请求时,如果存在缓存文档则返回缓存文档,而不是去网站请求,这样既加快了本地调试速度,也减轻了网站的压力
  75. HTTPCACHE_ENABLED = True
  76. HTTPCACHE_EXPIRATION_SECS = 0
  77. HTTPCACHE_DIR = 'httpcache'
  78. HTTPCACHE_IGNORE_HTTP_CODES = []
  79. HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  80. # Mongodb参数配置 ip/port/数据库名/集合名
  81. MONGODB_HOST = '127.0.0.1'
  82. MONGODB_PORT = 27017
  83. MONGODB_DBNAME = 'dangdang'
  84. MONGODB_DOCNAME = 'dangdang_collection'
  85. # 图片存放根目录
  86. IMAGES_STORE='./book_image'
pipelines.py
  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. from scrapy.utils.project import get_project_settings # 获取settings.py
  7. import pymongo
  8. from DangDang_Spider.items import DangdangSpiderItem
  9. import openpyxl
  10. import os
  11. from scrapy.pipelines.images import ImagesPipeline
  12. import scrapy
  13. from scrapy.exceptions import DropItem
  14. import matplotlib.pyplot as plt
  15. # 保存数据到mongodb
  16. class MongoPipeline(object):
  17. settings=get_project_settings()
  18. host = settings['MONGODB_HOST']
  19. port = settings['MONGODB_PORT']
  20. dbName = settings['MONGODB_DBNAME']
  21. collectionName = settings['MONGODB_DOCNAME']
  22. # 开始处理数据之前连接数据库
  23. def open_spider(self,spider):
  24. # 创建连接
  25. self.client=pymongo.MongoClient(host=self.host,port=self.port)
  26. # 创建数据库
  27. self.db=self.client[self.dbName]
  28. # 创建集合
  29. self.collection=self.db[self.collectionName]
  30. def process_item(self, item, spider):
  31. if isinstance(item,DangdangSpiderItem):
  32. # 处理数据,使每一组数据均包含应有信息
  33. book_name=item["book_name"]
  34. author=item['author']
  35. publisher=item['publisher']
  36. price=item['price']
  37. comments_num=item['comments_num']
  38. for book,au,pu,pr,co in zip(book_name,author,publisher,price,comments_num):
  39. data = {}
  40. data['book_name']=book
  41. data['author']=au
  42. data['publisher']=pu
  43. data['price']=pr
  44. data['comments_num']=co
  45. self.collection.insert_one(data)
  46. return item
  47. # 数据处理完之后关闭数据库
  48. def close_spider(self,spider):
  49. self.client.close()
  50. # 保存数据到表格
  51. class FilePipeline(object):
  52. def __init__(self):
  53. if os.path.exists("当当.xlsx"):
  54. self.wb = openpyxl.load_workbook("当当.xlsx") # 打开已有文件
  55. # 创建一张新表
  56. # ws=wb.create_sheet()
  57. self.ws = self.wb["Sheet"] # 通过名字选择表
  58. else:
  59. self.wb = openpyxl.Workbook() # 新建Excel 实例化
  60. self.ws = self.wb.active # 激活 worksheet
  61. self.ws.append(['书名','作者','出版社','价格','评论数'])
  62. self.ws.column_dimensions['A'].width = 55 # 列宽
  63. self.ws.column_dimensions['B'].width = 55
  64. self.ws.column_dimensions['C'].width = 25
  65. self.ws.column_dimensions['D'].width = 10
  66. self.ws.column_dimensions['E'].width = 15
  67. def process_item(self,item,spider):
  68. # 获取各数据列表的大小,进行排序,得到列表数据最少的长度,防止索引超出
  69. data_count = [len(item['book_name']), len(item['author']), len(item['publisher']), len(item['price']),
  70. len(item['comments_num']), ]
  71. # sorted列表排序,key=绝对按什么排序,reverse=True:降序;False:升序
  72. data_count_least = sorted(data_count, key=lambda data_num: int(data_num), reverse=False)[0]
  73. for i in range(data_count_least):
  74. line = [str(item['book_name'][i]), str(item['author'][i]), str(item['publisher'][i]), str(item['price'][i]), str(item['comments_num'][i])]
  75. self.ws.append(line)
  76. self.wb.save("当当.xlsx")
  77. return item
  78. # ImagesPipeLine下载图片
  79. class SaveImagePipeline(ImagesPipeline):
  80. # 下载图片
  81. def get_media_requests(self, item, info):
  82. # 循环下载图片,meta传递数据(搜索的书关键字,书名,文件的后缀),根据url准确获取其文件类型
  83. for i in range(len(item['image_url'])):
  84. yield scrapy.Request(url=item['image_url'][i],meta={'book_key':item['book_key'],'name':item['book_name'][i],'name_suffix':item['image_url'][i].split('.')[-1]})
  85. # 是否下载成功
  86. def item_completed(self, results, item, info):
  87. # results是一个元组,第一个元素是布尔类型,false:失败 true:成功
  88. if not results[0][0]:
  89. raise DropItem('下载失败') # 若结果为false,异常处理,丢弃item
  90. return item
  91. # 图片存放,文件重命名
  92. def file_path(self, request, response=None, info=None):
  93. # 获取meta传递的数据构建书名,如‘xxx.jpg’,‘xxx.png’ .replace('/','_')替换名称中的‘/’,防止其识别成文件夹
  94. book_name=request.meta['name'].replace('/','_')+'.'+request.meta['name_suffix']
  95. # 按搜索类型分别存到对应的文件夹下
  96. file_name=u'{0}/{1}'.format(request.meta['book_key'],book_name)
  97. return file_name
  98. # 提取评论数前10的书,并画水平条形图
  99. class PicturePipeline(object):
  100. comments_num=[]
  101. book_name=[]
  102. book_name_sorted=[]
  103. comments_num_ten=[]
  104. def process_item(self,item,spider):
  105. self.get_plot(item['book_name'],item['comments_num'])
  106. return item
  107. def get_plot(self, name_list, comments_num_list):
  108. # 获取所有的数据
  109. for comment,name in zip(comments_num_list,name_list):
  110. self.comments_num.append(comment)
  111. self.book_name.append(name)
  112. # 将书名和评论数组成字典
  113. book_dict= dict(zip(self.comments_num,self.book_name))
  114. # 按照字典的键进行倒序排序
  115. comments_num_sorted_list=sorted(book_dict.keys(),key=lambda num:int(num.split('条')[0]),reverse=True)
  116. # 获取评论数最高的10本书
  117. for i in range(10):
  118. for key in book_dict.keys():
  119. if comments_num_sorted_list[i]==key:
  120. self.book_name_sorted.append(book_dict[key])
  121. continue
  122. # 使用matplotlib.pyplot画水平条形图
  123. plt.rcParams['font.sans-serif'] = ['SimHei'] # 用黑体显示中文
  124. plt.rcParams['axes.unicode_minus'] = False # 正常显示负号
  125. # 默认的像素:[6.0,4.0],分辨率为100,图片尺寸为 600*400 ; 修改后图片尺寸为:2000*800
  126. plt.rcParams['figure.figsize']=(10.0,4.0) # 设置figure_size尺寸
  127. plt.rcParams['figure.dpi'] = 200 # 分辨率
  128. for i in range(10):
  129. self.comments_num_ten.append(int(comments_num_sorted_list[i].split('条')[0]))
  130. # width列表元素类型不能为str 故此转换为整形:int(comments_num_sorted_list[i].split('条')[0])
  131. plt.barh(range(10),width=self.comments_num_ten,label='评论数',color='red',alpha=0.8,height=0.7) # 从下往上画
  132. # 在柱状图上显示具体数值, ha参数控制水平对齐方式, va控制垂直对齐方式
  133. for y,x in enumerate(self.comments_num_ten):
  134. plt.text(x+1500,y-0.2,'%s'%x,ha='center',va='bottom')
  135. # 为Y轴设置坐标值
  136. plt.yticks(range(10),self.book_name_sorted,size=8)
  137. # 为坐标轴设置名称
  138. plt.ylabel('书名')
  139. # 设置标题
  140. plt.title('评论数前10的书籍')
  141. # 显示图例
  142. plt.legend()
  143. plt.show()

middlewares.py   

  1. from scrapy import signals
  2. # 设置referer防止反爬
  3. class DangdangSpiderRefererMiddleware(object):
  4. @classmethod
  5. def process_request(self,request,spider):
  6. referer=request.url
  7. if referer:
  8. request.headers['referer']=referer

tips:

1. 自定义的pipeline,需在settings.py中进行设置

  1. ITEM_PIPELINES = {
  2. 'DangDang_Spider.pipelines.MongoPipeline': 300, # 实现保存数据到mongodb
  3. 'DangDang_Spider.pipelines.FilePipeline': 400, # 实现保存数据到excel
  4. 'DangDang_Spider.pipelines.SaveImagePipeline':450, # 调用scrapy内部ImagesPipeline实现图片下载
  5. 'DangDang_Spider.pipelines.PicturePipeline':500 # 统计评论数最高的10本书,画图
  6. }

2. 使用ImagesPipeLine下载图片时,需在settings.py中设置图片存放目录

  1. # 图片存放根目录
  2. IMAGES_STORE='./book_image'

3. 设置referer防止反爬,需在settings.py中进行设置,其运行级别设为1,优先执行

  1. # 打开下载管道
  2. DOWNLOADER_MIDDLEWARES = {
  3. 'DangDang_Spider.middlewares.DangdangSpiderDownloaderMiddleware': 423,
  4. 'DangDang_Spider.middlewares.DangdangSpiderRefererMiddleware':1
  5. }

4. 画水平条形图matplotlib.pyplot.barh(y, width,label=, height=0.8,color='red',align='center')

width:代表条形图的宽度,即每个条形图具体的数值,其值若为str会出现错误,需进行转化

5. 图片在进行存储时需指定其文件类型(.jpg/.png等根据实际获取),避免图片保存后未能识别出文件类型,导致查看繁琐

6. 进行图片保存时,发现文件存储错乱(图片应该都在C++这个文件夹下,结果莫名多出几个文件夹),debug发现文件名中存在‘/’,系统进行了识别,在此做了简单处理,消除此现象   

运行结果:

1. 项目

2. 数据写入到表格

3. 下载图片

4. 画水平条形图

遗留问题:

在执行PicturePipeline画图时,会报错:ValueError: shape mismatch: objects cannot be broadcast to a single shape

暂未找到原因,有大神了解的麻烦告知,感谢

 

 

 

 

 

 

 

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/825009
推荐阅读
相关标签
  

闽ICP备14008679号