当前位置:   article > 正文

scrapy pipeline 管道 (图片,文件)_files_result_field

files_result_field

一.scrapy的图片管道可以方便的快速的批量的 下载图片连接

  一. 普通使用方法

  1. (1)settings. py
  2. 'scrapy. pipelines. imges. ImagesPipeline':300
  3. FILES_ STORE = 'D:\\ cnblogs' 存放位置
  4. FILES_ URLS_ FIELD = 'file_ urls' 下载图片url
  5. FILES_ RESULT_ FIELD = 'files' 图片名字
  6. FILES_ EXPIRES = 30 #30天过期
  7. (2)在item.py中添加两个字段
  8. img_urls = scrapy. Field()
  9. name= scrapy. Field()

二.通过继承使用图片管道 

  1.普通的单组

item_completed(self, results, item, info):方法中   是图片下载完后进行操作result  包含 图片的url ,存放路径path  检查完     整性 checksum在该方法中可以将 默认的路径path  替换 成新的路径new_path 以便分类保存图片

  1. from .settings import IMAGES_STORE
  2. from scrapy.pipelines.images import ImagesPipeline
  3. from scrapy.http import Request
  4. import os
  5. class DouyuproImagePipeline(ImagesPipeline):
  6. def get_media_requests(self, item, info):
  7. image_url = item['vertical_src']
  8. yield Request(url=image_url)
  9. def item_completed(self, results, item, info):
  10. # 老路径
  11. old_path = IMAGES_STORE + [x['path'] for ok, x in results if ok][0]
  12. # 新路径
  13. new_path = IMAGES_STORE + item['nickname'] + '.jpg'
  14. # 防止图片重复
  15. try:
  16. os.renames(old_path, new_path)
  17. except Exception as e:
  18. print('替换完成!')
  19. return item
  20. '''
  21. result=[
  22. (True, {'url': 'https://rpic.douyucdn.cn/asrpic/190725/6587811_2142.png/dy1',
  23. 'path': 'full/ab811811c57efac2ef5a354265e692eb44e0adb6.jpg',
  24. 'checksum': 'cb171aeba651caab1b7827da664ef7c0'})
  25. ]
  26. '''

   get_media_requests 方法中构造下载图片的请求 

    二.修改名字图片默认名字

    没有item中的名字命名图片时  图片顺序混乱

  1. def file_path(self, request, response=None, info=None):
  2. image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
  3. return 'full/%s.jpg' % (image_guid)

        原默认名字是以sha1  进行命名   重写这个方法也可以  修改图片名字

         下面这个演示了该方法的 一组item中有多个图片链接时  该如何处理:

  1. class ManhuaPipeline(ImagesPipeline):
  2. name=0
  3. def get_media_requests(self, item, info):
  4. image_url_list = item['url_list']
  5. for url in image_url_list:
  6. yield Request(url=url,meta={'big_name':item['name']})
  7. self.name=0
  8. def file_path(self, request, response=None, info=None):
  9. self.name+=1
  10. big_name=request.meta['big_name']
  11. file_path=big_name+'/'+str(self.name)+'.jpg'
  12. return file_path
  13. '''
  14. result=[
  15. (True, {'url': 'https://rpic.douyucdn.cn/asrpic/190725/6587811_2142.png/dy1',
  16. 'path': 'full/ab811811c57efac2ef5a354265e692eb44e0adb6.jpg',
  17. 'checksum': 'cb171aeba651caab1b7827da664ef7c0'}),
  18. (True, {'url': 'https://rpic.douyucdn.cn/asrpic/190725/6587811_2142.png/dy1',
  19. 'path': 'full/ab811811c57efac2ef5a354265e692eb44e0adb6.jpg',
  20. 'checksum': 'cb171aeba651caab1b7827da664ef7c0'}),
  21. (True, {'url': 'https://rpic.douyucdn.cn/asrpic/190725/6587811_2142.png/dy1',
  22. 'path': 'full/ab811811c57efac2ef5a354265e692eb44e0adb6.jpg',
  23. 'checksum': 'cb171aeba651caab1b7827da664ef7c0'})
  24. ]
  25. '''

       当下载漫画这种大量图片  并排序时

  1. class ManhuaPipeline(ImagesPipeline):
  2. name=0
  3. def get_media_requests(self, item, info):
  4. image_url_list = item['url_list']
  5. for url in image_url_list:
  6. self.name+=1
  7. yield Request(url=url,meta={'small_name':item['name'],'big_name':item['big_name'],'name':self.name})
  8. def file_path(self, request, response=None, info=None):
  9. small_name=request.meta['small_name']
  10. big_name=request.meta['big_name']
  11. name=request.meta['name']
  12. file_path=big_name+'/'+small_name+'/'+str(name)+'.jpg'
  13. print(file_path+'下载完成!')
  14. return file_path

 二.文件管道

  1. (1)
  2. settings. py
  3. 'scrapy. pipelines. files. FilesPipeline':300
  4. (2)在item中添加两个字段
  5. file_ urls = scrapy. Field()
  6. files = scrapy. Field()
  7. (3)settings. py
  8. FILES_ STORE = 'D:\\ cnblogs' 存放位置
  9. FILES_ URLS_ FIELD = 'file_ urls' 下载文件url
  10. FILES_ RESULT_ FIELD = 'files' 文件信息字段
  11. FILES_ EXPIRES = 30 #30天过期
  1. from scrapy.pipelines.files import FilesPipeline
  2. from scrapy.http import Request
  3. class LolPipeline(FilesPipeline):
  4. base_url='https://qt.qq.com/php_cgi/cod_video/php/get_video_url.php?game_id=2103041&vid='
  5. def get_media_requests(self, item, info):
  6. url=self.base_url+item['vid']
  7. return Request(url=url,meta={'view':item['view'],'name':item['game_name']})
  8. def file_path(self, request, response=None, info=None):
  9. view=request.meta['view']
  10. name=request.meta['name']
  11. path=str(view)+name+'.mp4'
  12. return path

   重点 :

      在下载文件有出现重定向url时,下载时日志信息会输出 

            downloader/response_status_count/200': 42,
           'downloader/response_status_count/302': 41,

     这时 在需要在setting  添加    MEDIA_ALLOW_REDIRECTS =True

  

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/437234
推荐阅读
相关标签
  

闽ICP备14008679号