当前位置:   article > 正文

Python 下载 图片、音乐、视频(requests、you-get、pycurl、wget、ffmpeg)_python request 下载图片

python request 下载图片

1、requests、you-get、pycurl、wget、ffmpeg

使用 requests 库下载图片

Python requests 库 的 用法:https://blog.csdn.net/freeking101/article/details/60868350

  1. # -*- coding: utf-8 -*-
  2. import requests
  3. def download_img():
  4. print("downloading with requests")
  5. # test_url = 'http://www.pythontab.com/test/demo.zip'
  6. # r = requests.get(test_url)
  7. # with open("./demo.zip", "wb") as ff:
  8. # ff.write(r.content)
  9. img_url = 'https://img9.doubanio.com/view/celebrity/s_ratio_celebrity/public/p28424.webp'
  10. r = requests.get(img_url)
  11. with open("./img.jpg", "wb") as ff:
  12. ff.write(r.content)
  13. if __name__ == '__main__':
  14. download_img()

使用 you-get 库下载视频

python 示例代码( you-get 多线程 下载视频 ):

  1. import os
  2. import subprocess
  3. from concurrent.futures import ThreadPoolExecutor, wait
  4. def download(url):
  5. video_data_dir = './vide_data_dir'
  6. try:
  7. os.makedirs(video_data_dir)
  8. except BaseException as be:
  9. pass
  10. video_id = url.split('/')[-1]
  11. video_name = f'{video_data_dir}/{video_id}'
  12. command = f'you-get -o ./video_data -O {video_name} ' + url
  13. print(command)
  14. subprocess.call(command, shell=True)
  15. print(f"退出线程 ---> {url}")
  16. def main():
  17. url_list = [
  18. 'https://www.bilibili.com/video/BV1Xz4y127Yo',
  19. 'https://www.bilibili.com/video/BV1yt4y1Q7SS',
  20. 'https://www.bilibili.com/video/BV1bW411n7fY',
  21. ]
  22. with ThreadPoolExecutor(max_workers=3) as pool:
  23. thread_id_list = [pool.submit(download, url) for url in url_list]
  24. wait(thread_id_list)
  25. if __name__ == '__main__':
  26. main()
'
运行

you-get 帮助

  1. D:\> you-get --help
  2. you-get: version 0.4.1555, a tiny downloader that scrapes the web.
  3. usage: you-get [OPTION]... URL...
  4. A tiny downloader that scrapes the web
  5. optional arguments:
  6. -V, --version Print version and exit
  7. -h, --help Print this help message and exit
  8. Dry-run options:
  9. (no actual downloading)
  10. -i, --info Print extracted information
  11. -u, --url Print extracted information with URLs
  12. --json Print extracted URLs in JSON format
  13. Download options:
  14. -n, --no-merge Do not merge video parts
  15. --no-caption Do not download captions (subtitles, lyrics, danmaku, ...)
  16. -f, --force Force overwriting existing files
  17. --skip-existing-file-size-check
  18. Skip existing file without checking file size
  19. -F STREAM_ID, --format STREAM_ID
  20. Set video format to STREAM_ID
  21. -O FILE, --output-filename FILE
  22. Set output filename
  23. -o DIR, --output-dir DIR
  24. Set output directory
  25. -p PLAYER, --player PLAYER
  26. Stream extracted URL to a PLAYER
  27. -c COOKIES_FILE, --cookies COOKIES_FILE
  28. Load cookies.txt or cookies.sqlite
  29. -t SECONDS, --timeout SECONDS
  30. Set socket timeout
  31. -d, --debug Show traceback and other debug info
  32. -I FILE, --input-file FILE
  33. Read non-playlist URLs from FILE
  34. -P PASSWORD, --password PASSWORD
  35. Set video visit password to PASSWORD
  36. -l, --playlist Prefer to download a playlist
  37. -a, --auto-rename Auto rename same name different files
  38. -k, --insecure ignore ssl errors
  39. Playlist optional options:
  40. --first FIRST the first number
  41. --last LAST the last number
  42. --size PAGE_SIZE, --page-size PAGE_SIZE
  43. the page size number
  44. Proxy options:
  45. -x HOST:PORT, --http-proxy HOST:PORT
  46. Use an HTTP proxy for downloading
  47. -y HOST:PORT, --extractor-proxy HOST:PORT
  48. Use an HTTP proxy for extracting only
  49. --no-proxy Never use a proxy
  50. -s HOST:PORT or USERNAME:PASSWORD@HOST:PORT, --socks-proxy HOST:PORT or USERNAME:PASSWORD@HOST:PORT
  51. Use an SOCKS5 proxy for downloading
  52. D:\>

命令行下载视频:you-get https://www.bilibili.com/video/BV1Xz4y127Yo

探测视频真实的播放地址:

  1. -i, --info Print extracted information
  2. -u, --url Print extracted information with URLs
  3. --json Print extracted URLs in JSON format
  • you-get -u https://www.bilibili.com/video/BV1Xz4y127Yo
  • you-get --json https://www.bilibili.com/video/BV1Xz4y127Yo

使用 pycurl 库

linux 安装 curl: yum install curl
Python 安装模块:pip install pycurl

python pycurl 模块详解:https://blog.csdn.net/xixihahalelehehe/article/details/105553488

使用 wget 库

使用 wget 命令:​wget http://www.robots.ox.ac.uk/~ankush/data.tar.gz
python 调用 wget 命令实现下载

使用 python 的 wget 模块:pip install wget

  1. import wget
  2. import tempfile
  3. url = 'https://p0.ifengimg.com/2019_30/1106F5849B0A2A2A03AAD4B14374596C76B2BDAB_w1000_h626.jpg'
  4. # 获取文件名
  5. file_name = wget.filename_from_url(url)
  6. print(file_name) #1106F5849B0A2A2A03AAD4B14374596C76B2BDAB_w1000_h626.jpg
  7. # 下载文件,使用默认文件名,结果返回文件名
  8. file_name = wget.download(url)
  9. print(file_name) #1106F5849B0A2A2A03AAD4B14374596C76B2BDAB_w1000_h626.jpg
  10. # 下载文件,重新命名输出文件名
  11. target_name = 't1.jpg'
  12. file_name = wget.download(url, out=target_name)
  13. print(file_name) #t1.jpg
  14. # 创建临时文件夹,下载到临时文件夹里
  15. tmpdir = tempfile.gettempdir()
  16. target_name = 't2.jpg'
  17. file_name = wget.download(url, out=os.path.join(tmpdir, target_name))
  18. print(file_name) #/tmp/t2.jpg

使用 ffmpeg

ffmpeg -ss 00:00:00 -i "https://vd4.bdstatic.com/mda-na67uu3bf6v85cnm/sc/cae_h264/1641533845968105062/mda-na67uu3bf6v85cnm.mp4?v_from_s=hkapp-haokan-hbe&auth_key=1641555906-0-0-642c8f9b47d4c37cc64d307be88df29d&bcevod_channel=searchbox_feed&pd=1&pt=3&logid=0906397151&vid=8050108300345362998&abtest=17376_2&klogid=0906397151" -t 00:05:00 -c copy "test.mp4"

ffmpeg 如何设置 header 信息

ffmpeg -user_agent "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36" -headers "sec-ch-ua: 'Chromium';v='88', 'Google Chrome';v='88', ';Not A Brand';v='99'"$'\r\n'"sec-ch-ua-mobile: ?0"$"Upgrade-Insecure-Requests: 1"  -i http://127.0.0.1:3000

如果只需要 ua 只加上 -user_agent 就可以。如果需要设置 -headers 其他选项时,多个选项用 $'\r\n' 链接起来。服务端接收数据格式正常,如图

ffmpeg 设置 header 请求头 UA 文件最大大小

ffmpeg -headers $'Origin: https://xxx.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36\r\nReferer: https://xxx.com' -threads 0 -i '地址' -c copy -y -f mpegts '文件名.ts' -v trace

使用-headers $’头一\r\n头二’添加header
注意顺序 ,放在命令行最后面无法生效!!!!!
后来输出了一下信息才发现问题
-v trace 用于输出当前的header信息方便调试
设置 UA 可以使用单独的 -user-agent 指令
在输出文件名前使用 -fs 1024K 限制为 1024K

ffmpeg 帮助信息:ffmpeg --help

  1. Getting help:
  2. -h -- print basic options
  3. -h long -- print more options
  4. -h full -- print all options (including all format and codec specific options, very long)
  5. -h type=name -- print all options for the named decoder/encoder/demuxer/muxer/filter/bsf/protocol
  6. See man ffmpeg for detailed description of the options.
  7. Print help / information / capabilities:
  8. -L show license
  9. -h topic show help
  10. -? topic show help
  11. -help topic show help
  12. --help topic show help
  13. -version show version
  14. -buildconf show build configuration
  15. -formats show available formats
  16. -muxers show available muxers
  17. -demuxers show available demuxers
  18. -devices show available devices
  19. -codecs show available codecs
  20. -decoders show available decoders
  21. -encoders show available encoders
  22. -bsfs show available bit stream filters
  23. -protocols show available protocols
  24. -filters show available filters
  25. -pix_fmts show available pixel formats
  26. -layouts show standard channel layouts
  27. -sample_fmts show available audio sample formats
  28. -dispositions show available stream dispositions
  29. -colors show available color names
  30. -sources device list sources of the input device
  31. -sinks device list sinks of the output device
  32. -hwaccels show available HW acceleration methods
  33. Global options (affect whole program instead of just one file):
  34. -loglevel loglevel set logging level
  35. -v loglevel set logging level
  36. -report generate a report
  37. -max_alloc bytes set maximum size of a single allocated block
  38. -y overwrite output files
  39. -n never overwrite output files
  40. -ignore_unknown Ignore unknown stream types
  41. -filter_threads number of non-complex filter threads
  42. -filter_complex_threads number of threads for -filter_complex
  43. -stats print progress report during encoding
  44. -max_error_rate maximum error rate ratio of decoding errors (0.0: no errors, 1.0: 100% errors) above which ffmpeg returns an error instead of success.
  45. -vol volume change audio volume (256=normal)
  46. Per-file main options:
  47. -f fmt force format
  48. -c codec codec name
  49. -codec codec codec name
  50. -pre preset preset name
  51. -map_metadata outfile[,metadata]:infile[,metadata] set metadata information of outfile from infile
  52. -t duration record or transcode "duration" seconds of audio/video
  53. -to time_stop record or transcode stop time
  54. -fs limit_size set the limit file size in bytes
  55. -ss time_off set the start time offset
  56. -sseof time_off set the start time offset relative to EOF
  57. -seek_timestamp enable/disable seeking by timestamp with -ss
  58. -timestamp time set the recording timestamp ('now' to set the current time)
  59. -metadata string=string add metadata
  60. -program title=string:st=number... add program with specified streams
  61. -target type specify target file type ("vcd", "svcd", "dvd", "dv" or "dv50" with optional prefixes "pal-", "ntsc-" or "film-")
  62. -apad audio pad
  63. -frames number set the number of frames to output
  64. -filter filter_graph set stream filtergraph
  65. -filter_script filename read stream filtergraph description from a file
  66. -reinit_filter reinit filtergraph on input parameter changes
  67. -discard discard
  68. -disposition disposition
  69. Video options:
  70. -vframes number set the number of video frames to output
  71. -r rate set frame rate (Hz value, fraction or abbreviation)
  72. -fpsmax rate set max frame rate (Hz value, fraction or abbreviation)
  73. -s size set frame size (WxH or abbreviation)
  74. -aspect aspect set aspect ratio (4:3, 16:9 or 1.3333, 1.7777)
  75. -vn disable video
  76. -vcodec codec force video codec ('copy' to copy stream)
  77. -timecode hh:mm:ss[:;.]ff set initial TimeCode value.
  78. -pass n select the pass number (1 to 3)
  79. -vf filter_graph set video filters
  80. -ab bitrate audio bitrate (please use -b:a)
  81. -b bitrate video bitrate (please use -b:v)
  82. -dn disable data
  83. Audio options:
  84. -aframes number set the number of audio frames to output
  85. -aq quality set audio quality (codec-specific)
  86. -ar rate set audio sampling rate (in Hz)
  87. -ac channels set number of audio channels
  88. -an disable audio
  89. -acodec codec force audio codec ('copy' to copy stream)
  90. -vol volume change audio volume (256=normal)
  91. -af filter_graph set audio filters
  92. Subtitle options:
  93. -s size set frame size (WxH or abbreviation)
  94. -sn disable subtitle
  95. -scodec codec force subtitle codec ('copy' to copy stream)
  96. -stag fourcc/tag force subtitle tag/fourcc
  97. -fix_sub_duration fix subtitles duration
  98. -canvas_size size set canvas size (WxH or abbreviation)
  99. -spre preset set the subtitle options to the indicated preset

2、使用 requests 模块显示下载进度

使用 requests 库显示下载进度: http://blog.csdn.net/supercooly/article/details/51046561

1. 相关资料

请求关键参数:stream=True。默认情况下,当你进行网络请求后,响应体会立即被下载。你可以通过 stream 参数覆盖这个行为,推迟下载响应体直到访问 Response.content 属性。

  1. import json
  2. import requests
  3. tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
  4. r = requests.get(tarball_url, stream=True) # 此时仅有响应头被下载下来了,连接保持打开状态,响应体并没有下载。
  5. print(json.dumps(dict(r.headers), ensure_ascii=False, indent=4))
  6. # if int(r.headers['content-length']) < TOO_LONG:
  7. # content = r.content # 只要访问 Response.content 属性,就开始下载响应体
  8. # # ...
  9. # pass

进一步使用 Response.iter_content 和 Response.iter_lines 方法来控制工作流,或者以 Response.raw 从底层 urllib3 的 urllib3.HTTPResponse

  1. from contextlib import closing
  2. with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
  3. # Do things with the response here.
  4. pass

保持活动状态(持久连接) 
归功于 urllib3,同一会话内的持久连接是完全自动处理的,同一会话内发出的任何请求都会自动复用恰当的连接!

注意:只有当响应体的所有数据被读取完毕时,连接才会被释放到连接池;所以确保将 stream 设置为 False 或读取 Response 对象的 content 属性。

2. 下载文件并显示进度条

在 Python3 中,print()方法的默认结束符(end=’\n’),当调用完之后,光标自动切换到下一行,此时就不能更新原有输出。

将结束符改为 “\r” ,输出完成之后,光标会回到行首,并不换行。此时再次调用 print() 方法,就会更新这一行输出了。

结束符也可以使用 “\d”,为退格符,光标回退一格,可以使用多个,按需求回退。

在结束这一行输出时,将结束符改回 “\n” 或者不指定使用默认

下面是一个格式化的进度条显示模块。代码如下:

  1. #!/usr/bin/env python3
  2. import requests
  3. from contextlib import closing
  4. """
  5. 作者:微微寒
  6. 链接:https://www.zhihu.com/question/41132103/answer/93438156
  7. 来源:知乎
  8. 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
  9. """
  10. class ProgressBar(object):
  11. def __init__(
  12. self, title, count=0.0, run_status=None, fin_status=None,
  13. total=100.0, unit='', sep='/', chunk_size=1.0
  14. ):
  15. super(ProgressBar, self).__init__()
  16. self.info = "[%s] %s %.2f %s %s %.2f %s"
  17. self.title = title
  18. self.total = total
  19. self.count = count
  20. self.chunk_size = chunk_size
  21. self.status = run_status or ""
  22. self.fin_status = fin_status or " " * len(self.status)
  23. self.unit = unit
  24. self.seq = sep
  25. def __get_info(self):
  26. # 【名称】状态 进度 单位 分割线 总数 单位
  27. _info = self.info % (
  28. self.title, self.status, self.count/self.chunk_size,
  29. self.unit, self.seq, self.total/self.chunk_size, self.unit
  30. )
  31. return _info
  32. def refresh(self, count=1, status=None):
  33. self.count += count
  34. # if status is not None:
  35. self.status = status or self.status
  36. end_str = "\r"
  37. if self.count >= self.total:
  38. end_str = '\n'
  39. self.status = status or self.fin_status
  40. print(self.__get_info(), end=end_str)
  41. def main():
  42. with closing(requests.get("http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3", stream=True)) as response:
  43. chunk_size = 1024
  44. content_size = int(response.headers['content-length'])
  45. progress = ProgressBar(
  46. "razorback", total=content_size, unit="KB",
  47. chunk_size=chunk_size, run_status="正在下载", fin_status="下载完成"
  48. )
  49. # chunk_size = chunk_size < content_size and chunk_size or content_size
  50. with open('./file.mp3', "wb") as file:
  51. for data in response.iter_content(chunk_size=chunk_size):
  52. file.write(data)
  53. progress.refresh(count=len(data))
  54. if __name__ == '__main__':
  55. main()

3、Python 编写断点续传下载软件

断点续传下载软件:
https://www.leavesongs.com/PYTHON/resume-download-from-break-point-tool-by-python.html

Python实现下载界面(带进度条,断点续传,多线程多任务下载等):https://blog.51cto.com/eddy72/2106091
视频下载以及断点续传( 使用 aiohttp 并发 ):https://www.cnblogs.com/baili-luoyun/p/10507608.html

另一种方法是调用 curl 之类支持断点续传的下载工具。

一、HTTP 断点续传原理

其实 HTTP 断点续传原理比较简单,在 HTTP 数据包中,可以增加 Range 头,这个头以字节为单位指定请求的范围,来下载范围内的字节流。如:

如上图勾下来的地方,我们发送数据包时选定请求的内容的范围,返回包即获得相应长度的内容。所以,我们在下载的时候,可以将目标文件分成很多“小块”,每次下载一小块(用Range标明小块的范围),直到把所有小块下载完。

当网络中断,或出错导致下载终止时,我们只需要记录下已经下载了哪些“小块”,还没有下载哪些。下次下载的时候在Range处填写未下载的小块的范围即可,这样就能构成一个断点续传。

其实像迅雷这种多线程下载器也是同样的原理。将目标文件分成一些小块,再分配给不同线程去下载,最后整合再检查完整性即可。

二、Python 下载文件实现方式

我们仍然使用之前介绍过的 requests 库作为 HTTP 请求库。

先看看这段文档:Advanced Usage — Requests 2.27.1 documentation,当请求时设置steam=True的时候就不会立即关闭连接,而我们以流的形式读取body,直到所有信息读取完全或者调用Response.close关闭连接。

所以,如果要下载大文件的话,就将 steam 设置为True,慢慢下载,而不是等整个文件下载完才返回。

stackoverflow上有同学给出了一个简单的下载 demo:

  1. #!/usr/bin/env python3
  2. import requests
  3. def download_file(url):
  4. local_filename = url.split('/')[-1]
  5. # NOTE the stream=True parameter
  6. r = requests.get(url, stream=True)
  7. with open(local_filename, 'wb') as f:
  8. for chunk in r.iter_content(chunk_size=1024):
  9. if chunk: # filter out keep-alive new chunks
  10. f.write(chunk)
  11. f.flush()
  12. return local_filename
'
运行

这基本上就是我们核心的下载代码了。

  • 当使用 requests 的 get 下载大文件/数据时,建议使用使用 stream 模式。
  • 当把 get 函数的 stream 参数设置成 False 时,它会立即开始下载文件并放到内存中,如果文件过大,有可能导致内存不足。
  • 当把 get 函数的 stream 参数设置成 True 时,它不会立即开始下载,当你使用 iter_content 或 iter_lines 遍历内容或访问内容属性时才开始下载。需要注意一点:文件没有下载之前,它也需要保持连接。

iter_content:一块一块的遍历要下载的内容
iter_lines:一行一行的遍历要下载的内容

使用上面两个函数下载大文件可以防止占用过多的内存,因为每次只下载小部分数据。

  1. 示例代码:
  2. r = requests.get(url_file, stream=True)
  3. f = open("file_path", "wb")
  4. for chunk in r.iter_content(chunk_size=512):
  5. if chunk:
  6. f.write(chunk)

三、断点续传结合大文件下载

现在结合这两个知识点写个小程序:支持断点续传的下载器。

我们可以先考虑一下需要注意的有哪些点,或者可以添加的功能有哪些:

1. 用户自定义性:可以定义cookie、referer、user-agent。如某些下载站检查用户登录才允许下载等情况。
2. 很多服务端不支持断点续传,如何判断?
3. 怎么去表达进度条?
4. 如何得知文件的总大小?使用HEAD请求?那么服务器不支持HEAD请求怎么办?
5. 下载后的文件名怎么处理?还要考虑windows不允许哪些字符做文件名。
    (header中可能有filename,url中也有filename,用户还可以自己指定filename)
6. 如何去分块,是否加入多线程。

其实想一下还是有很多疑虑,而且有些地方可能一时还解决不了。先大概想一下各个问题的答案:

1. headers可以由用户自定义
2. 正式下载之前先HEAD请求,得到服务器status code是否是206,
    header中是否有Range-content等标志,判断是否支持断点续传。
3. 可以先不使用进度条,只显示当前下载大小和总大小
4. 在HEAD请求中匹配出Range-content中的文件总大小,或获得content-length大小(当不支持断点续传的
   时候会返回总content-length)。如果不支持HEAD请求或没有content-type就设置总大小为0.
   (总之不会妨碍下载即可)
5. 文件名优先级:用户自定义 > header中content-disposition > url中的定义,为了避免麻烦,
    我这里和linux下的wget一样,忽略content-disposition的定义。
    如果用户不指定保存的用户名的话,就以url中最后一个“/”后的内容作为用户名。
6. 为了稳定和简单,不做多线程了。如果不做多线程的话,我们分块就可以按照很小来分,如1KB,然后从头
   开始下载,一K一K这样往后填充。这样避免了很多麻烦。当下载中断的时候,我们只需要简单查看
   当前已经下载了多少字节,就可以从这个字节后一个开始继续下载。

解决了这些疑问,我们就开始动笔了。实际上,疑问并不是在未动笔的时候干想出来的,
基本都是我写了一半突然发现的问题。

解决了这些疑问,我们就开始动笔了。实际上,疑问并不是在未动笔的时候干想出来的,基本都是我写了一半突然发现的问题。

  1. def download(self, url, filename, headers = {}):
  2. finished = False
  3. block = self.config['block']
  4. local_filename = self.remove_nonchars(filename)
  5. tmp_filename = local_filename + '.downtmp'
  6. if self.support_continue(url): # 支持断点续传
  7. try:
  8. with open(tmp_filename, 'rb') as fin:
  9. self.size = int(fin.read()) + 1
  10. except:
  11. self.touch(tmp_filename)
  12. finally:
  13. headers['Range'] = "bytes=%d-" % (self.size, )
  14. else:
  15. self.touch(tmp_filename)
  16. self.touch(local_filename)
  17. size = self.size
  18. total = self.total
  19. r = requests.get(url, stream = True, verify = False, headers = headers)
  20. if total > 0:
  21. print "[+] Size: %dKB" % (total / 1024)
  22. else:
  23. print "[+] Size: None"
  24. start_t = time.time()
  25. with open(local_filename, 'ab') as f:
  26. try:
  27. for chunk in r.iter_content(chunk_size = block):
  28. if chunk:
  29. f.write(chunk)
  30. size += len(chunk)
  31. f.flush()
  32. sys.stdout.write('\b' * 64 + 'Now: %d, Total: %s' % (size, total))
  33. sys.stdout.flush()
  34. finished = True
  35. os.remove(tmp_filename)
  36. spend = int(time.time() - start_t)
  37. speed = int(size / 1024 / spend)
  38. sys.stdout.write('\nDownload Finished!\nTotal Time: %ss, Download Speed: %sk/s\n' % (spend, speed))
  39. sys.stdout.flush()
  40. except:
  41. import traceback
  42. print traceback.print_exc()
  43. print "\nDownload pause.\n"
  44. finally:
  45. if not finished:
  46. with open(tmp_filename, 'wb') as ftmp:
  47. ftmp.write(str(size))

这是下载的方法。首先if语句调用 self.support_continue(url) 判断是否支持断点续传。如果支持则从一个临时文件中读取当前已经下载了多少字节,如果不存在这个文件则会抛出错误,那么size默认=0,说明一个字节都没有下载。

然后就请求url,获得下载连接,for循环下载。这个时候我们得抓住异常,一旦出现异常,不能让程序退出,而是正常将当前已下载字节size写入临时文件中。下次再次下载的时候读取这个文件,将Range设置成bytes=(size+1)-,也就是从当前字节的后一个字节开始到结束的范围。从这个范围开始下载,来实现一个断点续传。

判断是否支持断点续传的方法还兼顾了一个获得目标文件大小的功能:

  1. def support_continue(self, url):
  2. headers = {
  3. 'Range': 'bytes=0-4'
  4. }
  5. try:
  6. r = requests.head(url, headers = headers)
  7. crange = r.headers['content-range']
  8. self.total = int(re.match(ur'^bytes 0-4/(\d+)$', crange).group(1))
  9. return True
  10. except:
  11. pass
  12. try:
  13. self.total = int(r.headers['content-length'])
  14. except:
  15. self.total = 0
  16. return False

用正则匹配出大小,获得直接获取 headers['content-length'],获得将其设置为0.

核心代码基本上就是这些,再就是一些设置。github:py-wget/py-wget.py at master · phith0n/py-wget · GitHub

运行程序,获取 emlog 最新的安装包:

中间我按 Ctrl + C人工打断了下载进程,但之后还是继续下载,实现了“断点续传”。

但在我实际测试过程中,并不是那么多请求可以断点续传的,所以我对于不支持断点续传的文件这样处理:重新下载。

下载后的压缩包正常解压,也充分证明了下载的完整性:

动态图演示

github 地址:一个支持断点续传的小下载器:py-wget:GitHub - phith0n/py-wget: small wget by python

多线程下载文件

示例代码:

  1. # 在python3下测试
  2. import sys
  3. import requests
  4. import threading
  5. import datetime
  6. # 传入的命令行参数,要下载文件的url
  7. url = sys.argv[1]
  8. def Handler(start, end, url, filename):
  9. headers = {'Range': 'bytes=%d-%d' % (start, end)}
  10. r = requests.get(url, headers=headers, stream=True)
  11. # 写入文件对应位置
  12. with open(filename, "r+b") as fp:
  13. fp.seek(start)
  14. var = fp.tell()
  15. fp.write(r.content)
  16. def download_file(url, num_thread = 5):
  17. r = requests.head(url)
  18. try:
  19. file_name = url.split('/')[-1]
  20. file_size = int(r.headers['content-length']) # Content-Length获得文件主体的大小,当http服务器使用Connection:keep-alive时,不支持Content-Length
  21. except:
  22. print("检查URL,或不支持对线程下载")
  23. return
  24. # 创建一个和要下载文件一样大小的文件
  25. fp = open(file_name, "wb")
  26. fp.truncate(file_size)
  27. fp.close()
  28. # 启动多线程写文件
  29. part = file_size // num_thread # 如果不能整除,最后一块应该多几个字节
  30. for i in range(num_thread):
  31. start = part * i
  32. if i == num_thread - 1: # 最后一块
  33. end = file_size
  34. else:
  35. end = start + part
  36. t = threading.Thread(target=Handler, kwargs={'start': start, 'end': end, 'url': url, 'filename': file_name})
  37. t.setDaemon(True)
  38. t.start()
  39. # 等待所有线程下载完成
  40. main_thread = threading.current_thread()
  41. for t in threading.enumerate():
  42. if t is main_thread:
  43. continue
  44. t.join()
  45. print('%s 下载完成' % file_name)
  46. if __name__ == '__main__':
  47. start = datetime.datetime.now().replace(microsecond=0)
  48. download_file(url)
  49. end = datetime.datetime.now().replace(microsecond=0)
  50. print("用时: ", end='')
  51. print(end-start)

Python下载图片、音乐、视频

示例代码:

  1. # -*- coding:utf-8 -*-
  2. import re
  3. import requests
  4. from contextlib import closing
  5. from lxml import etree
  6. class Spider(object):
  7. """ crawl image """
  8. def __init__(self):
  9. self.index = 0
  10. self.url = "http://www.xiaohuar.com"
  11. self.proxies = {"http": "http://172.17.18.80:8080", "https": "https://172.17.18.80:8080"}
  12. pass
  13. def download_image(self, image_url):
  14. real_url = self.url + image_url
  15. print "downloading the {0} image".format(self.index)
  16. with open("{0}.jpg".format(self.index), 'wb') as f:
  17. self.index += 1
  18. f.write(requests.get(real_url, proxies=self.proxies).content)
  19. pass
  20. pass
  21. def start_crawl(self):
  22. start_url = "http://www.xiaohuar.com/hua/"
  23. r = requests.get(start_url, proxies=self.proxies)
  24. if r.status_code == 200:
  25. temp = r.content.decode("gbk")
  26. html = etree.HTML(temp)
  27. links = html.xpath('//div[@class="item_t"]//img/@src')
  28. map(self.download_image, links)
  29. # next_page_url = html.xpath('//div[@class="page_num"]//a/text()')
  30. # print next_page_url[-1]
  31. # print next_page_url[-2]
  32. # print next_page_url[-3]
  33. next_page_url = html.xpath(u'//div[@class="page_num"]//a[contains(text(),"下一页")]/@href')
  34. page_num = 2
  35. while next_page_url:
  36. print "download {0} page images".format(page_num)
  37. r_next = requests.get(next_page_url[0], proxies=self.proxies)
  38. if r_next.status_code == 200:
  39. html = etree.HTML(r_next.content.decode("gbk"))
  40. links = html.xpath('//div[@class="item_t"]//img/@src')
  41. map(self.download_image, links)
  42. try:
  43. next_page_url = html.xpath(u'//div[@class="page_num"]//a[contains(text(),"下一页")]/@href')
  44. except BaseException as e:
  45. next_page_url = None
  46. print e
  47. page_num += 1
  48. pass
  49. else:
  50. print "response status code : {0}".format(r_next.status_code)
  51. pass
  52. else:
  53. print "response status code : {0}".format(r.status_code)
  54. pass
  55. class ProgressBar(object):
  56. def __init__(self, title, count=0.0, run_status=None, fin_status=None, total=100.0, unit='', sep='/', chunk_size=1.0):
  57. super(ProgressBar, self).__init__()
  58. self.info = "[%s] %s %.2f %s %s %.2f %s"
  59. self.title = title
  60. self.total = total
  61. self.count = count
  62. self.chunk_size = chunk_size
  63. self.status = run_status or ""
  64. self.fin_status = fin_status or " " * len(self.status)
  65. self.unit = unit
  66. self.seq = sep
  67. def __get_info(self):
  68. # 【名称】状态 进度 单位 分割线 总数 单位
  69. _info = self.info % (self.title, self.status,
  70. self.count / self.chunk_size, self.unit, self.seq, self.total / self.chunk_size, self.unit)
  71. return _info
  72. def refresh(self, count=1, status=None):
  73. self.count += count
  74. # if status is not None:
  75. self.status = status or self.status
  76. end_str = "\r"
  77. if self.count >= self.total:
  78. end_str = '\n'
  79. self.status = status or self.fin_status
  80. print self.__get_info(), end_str
  81. def download_mp4(video_url):
  82. print video_url
  83. try:
  84. with closing(requests.get(video_url.strip().decode(), stream=True)) as response:
  85. chunk_size = 1024
  86. with open('./{0}'.format(video_url.split('/')[-1]), "wb") as f:
  87. for data in response.iter_content(chunk_size=chunk_size):
  88. f.write(data)
  89. f.flush()
  90. except BaseException as e:
  91. print e
  92. return
  93. def mp4():
  94. proxies = {"http": "http://172.17.18.80:8080", "https": "https://172.17.18.80:8080"}
  95. url = "http://www.budejie.com/video/"
  96. r = requests.get(url)
  97. print r.url
  98. if r.status_code == 200:
  99. print "status_code:{0}".format(r.status_code)
  100. content = r.content
  101. video_urls_compile = re.compile("http://.*?\.mp4")
  102. video_urls = re.findall(video_urls_compile, content)
  103. print len(video_urls)
  104. # print video_urls
  105. map(download_mp4, video_urls)
  106. else:
  107. print "status_code:{0}".format(r.status_code)
  108. def mp3():
  109. proxies = {"http": "http://172.17.18.80:8080", "https": "https://172.17.18.80:8080"}
  110. with closing(requests.get("http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3", proxies=proxies, stream=True)) as response:
  111. chunk_size = 1024
  112. content_size = int(response.headers['content-length'])
  113. progress = ProgressBar("razorback", total=content_size, unit="KB", chunk_size=chunk_size, run_status="正在下载",
  114. fin_status="下载完成")
  115. # chunk_size = chunk_size < content_size and chunk_size or content_size
  116. with open('./file.mp3', "wb") as f:
  117. for data in response.iter_content(chunk_size=chunk_size):
  118. f.write(data)
  119. progress.refresh(count=len(data))
  120. if __name__ == "__main__":
  121. t = Spider()
  122. t.start_crawl()
  123. mp3()
  124. mp4()
  125. pass

下载视频的效果

另一个下载图片示例代码:

( github 地址:https://github.com/injetlee/Python/blob/master/爬虫集合/meizitu.py )

包括了创建文件夹,利用多线程爬取,设置的是5个线程,可以根据自己机器自己来设置一下。

  1. import requests
  2. import os
  3. import time
  4. import threading
  5. from bs4 import BeautifulSoup
  6. def download_page(url):
  7. '''
  8. 用于下载页面
  9. '''
  10. headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
  11. r = requests.get(url, headers=headers)
  12. r.encoding = 'gb2312'
  13. return r.text
  14. def get_pic_list(html):
  15. '''
  16. 获取每个页面的套图列表,之后循环调用get_pic函数获取图片
  17. '''
  18. soup = BeautifulSoup(html, 'html.parser')
  19. pic_list = soup.find_all('li', class_='wp-item')
  20. for i in pic_list:
  21. a_tag = i.find('h3', class_='tit').find('a')
  22. link = a_tag.get('href')
  23. text = a_tag.get_text()
  24. get_pic(link, text)
  25. def get_pic(link, text):
  26. '''
  27. 获取当前页面的图片,并保存
  28. '''
  29. html = download_page(link) # 下载界面
  30. soup = BeautifulSoup(html, 'html.parser')
  31. pic_list = soup.find('div', id="picture").find_all('img') # 找到界面所有图片
  32. headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
  33. create_dir('pic/{}'.format(text))
  34. for i in pic_list:
  35. pic_link = i.get('src') # 拿到图片的具体 url
  36. r = requests.get(pic_link, headers=headers) # 下载图片,之后保存到文件
  37. with open('pic/{}/{}'.format(text, link.split('/')[-1]), 'wb') as f:
  38. f.write(r.content)
  39. time.sleep(1) # 休息一下,不要给网站太大压力,避免被封
  40. def create_dir(name):
  41. if not os.path.exists(name):
  42. os.makedirs(name)
  43. def execute(url):
  44. page_html = download_page(url)
  45. get_pic_list(page_html)
  46. def main():
  47. create_dir('pic')
  48. queue = [i for i in range(1, 72)] # 构造 url 链接 页码。
  49. threads = []
  50. while len(queue) > 0:
  51. for thread in threads:
  52. if not thread.is_alive():
  53. threads.remove(thread)
  54. while len(threads) < 5 and len(queue) > 0: # 最大线程数设置为 5
  55. cur_page = queue.pop(0)
  56. url = 'http://meizitu.com/a/more_{}.html'.format(cur_page)
  57. thread = threading.Thread(target=execute, args=(url,))
  58. thread.setDaemon(True)
  59. thread.start()
  60. print('{}正在下载{}页'.format(threading.current_thread().name, cur_page))
  61. threads.append(thread)
  62. if __name__ == '__main__':
  63. main()

Python 使用 requests 爬取图片

爬取 校花网:http://www.xueshengmai.com/hua/  大学校花 的图片

Python使用Scrapy爬虫框架全站爬取图片并保存本地(@妹子图@):https://www.cnblogs.com/william126/p/6923017.html

单线程版本

代码:

  1. # -*- coding: utf-8 -*-
  2. import os
  3. import requests
  4. # from PIL import Image
  5. from lxml import etree
  6. class Spider(object):
  7. """ crawl image """
  8. def __init__(self):
  9. self.index = 0
  10. self.url = "http://www.xueshengmai.com"
  11. # self.proxies = {
  12. # "http": "http://172.17.18.80:8080",
  13. # "https": "https://172.17.18.80:8080"
  14. # }
  15. pass
  16. def download_image(self, image_url):
  17. real_url = self.url + image_url
  18. print("downloading the {0} image".format(self.index))
  19. with open("./{0}.jpg".format(self.index), 'wb') as f:
  20. self.index += 1
  21. try:
  22. r = requests.get(
  23. real_url,
  24. # proxies=self.proxies
  25. )
  26. if 200 == r.status_code:
  27. f.write(r.content)
  28. except BaseException as e:
  29. print(e)
  30. pass
  31. def add_url_prefix(self, image_url):
  32. return self.url + image_url
  33. def start_crawl(self):
  34. start_url = "http://www点xueshengmai点com/hua/"
  35. r = requests.get(
  36. start_url,
  37. # proxies=self.proxies
  38. )
  39. if 200 == r.status_code:
  40. temp = r.content.decode("gbk")
  41. html = etree.HTML(temp)
  42. links = html.xpath('//div[@class="item_t"]//img/@src')
  43. # url_list = list(map(lambda image_url=None: self.url + image_url, links))
  44. ###################################################################
  45. # python2
  46. # map(self.download_image, links)
  47. # python3 返回的是一个 map object ,所以需要 使用 list 包括下
  48. list(map(self.download_image, links))
  49. ###################################################################
  50. next_page_url = html.xpath(u'//div[@class="page_num"]//a[contains(text(),"下一页")]/@href')
  51. page_num = 2
  52. while next_page_url:
  53. print("download {0} page images".format(page_num))
  54. r_next = requests.get(
  55. next_page_url[0],
  56. # proxies=self.proxies
  57. )
  58. if r_next.status_code == 200:
  59. html = etree.HTML(r_next.content.decode("gbk"))
  60. links = html.xpath('//div[@class="item_t"]//img/@src')
  61. # python3 返回的是一个 map object ,所以需要 使用 list 包括下
  62. list(map(self.download_image, links))
  63. try:
  64. t_x_string = u'//div[@class="page_num"]//a[contains(text(),"下一页")]/@href'
  65. next_page_url = html.xpath(t_x_string)
  66. except BaseException as e:
  67. next_page_url = None
  68. # print e
  69. page_num += 1
  70. pass
  71. else:
  72. print("response status code : {0}".format(r_next.status_code))
  73. pass
  74. else:
  75. print("response status code : {0}".format(r.status_code))
  76. pass
  77. if __name__ == "__main__":
  78. t = Spider()
  79. t.start_crawl()
  80. pause = input("press any key to continue")
  81. pass

抓取 "妹子图"  代码:

  1. # coding=utf-8
  2. import requests
  3. import os
  4. from lxml import etree
  5. import sys
  6. '''
  7. reload(sys)
  8. sys.setdefaultencoding('utf-8')
  9. '''
  10. platform = 'Windows' if os.name == 'nt' else 'Linux'
  11. print(f'当前系统是 【{platform}】 系统')
  12. # http请求头
  13. header = {
  14. # ':authority': 'www点mzitu点com',
  15. # ':method': 'GET',
  16. 'accept': '*/*',
  17. 'accept-encoding': 'gzip, deflate, br',
  18. 'referer': 'https://www点mzitu点com',
  19. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
  20. 'Chrome/75.0.3770.90 Safari/537.36'
  21. }
  22. site_url = 'http://www点mzitu点com'
  23. url_prefix = 'http://www点mzitu点com/page/'
  24. img_save_path = 'C:/mzitu/'
  25. def get_page_max_num(page_html=None, flag=1):
  26. """
  27. :param page_html: 页面的 HTML 文本
  28. :param flag: 表示是 那个页面,1:所有妹子的列表页面。2:每个妹子单独的图片页面。
  29. :return:
  30. """
  31. # 找寻最大页数
  32. s_html = etree.HTML(page_html)
  33. xpath_string = '//div[@class="nav-links"]//a' if 1 == flag \
  34. else '//div[@class="pagenavi"]//a//span'
  35. display_page_link = s_html.xpath(xpath_string)
  36. # print(display_page_link[-1].text)
  37. max_num = display_page_link[-2].text if '下一页»' == display_page_link[-1].text \
  38. else display_page_link[-1].text
  39. return int(max_num)
  40. def main():
  41. site_html = requests.get(site_url, headers=header).text
  42. page_max_num_1 = get_page_max_num(site_html)
  43. for page_num in range(1, page_max_num_1 + 1):
  44. page_url = f'{url_prefix}{page_num}'
  45. page_html = requests.get(page_url, headers=header).text
  46. s_page_html = etree.HTML(text=page_html)
  47. every_page_mm_url_list = s_page_html.xpath(
  48. '//ul[@id="pins"]//li[not(@class="box")]/span/a'
  49. )
  50. for tag_a in every_page_mm_url_list:
  51. mm_url = tag_a.get('href')
  52. title = tag_a.text.replace('\\', '').replace('/', '').replace(':', '')
  53. title = title.replace('*', '').replace('?', '').replace('"', '')
  54. title = title.replace('<', '').replace('>', '').replace('|', '')
  55. mm_dir = f'{img_save_path}{title}'
  56. if not os.path.exists(mm_dir):
  57. os.makedirs(mm_dir)
  58. print(f'【{title}】开始下载')
  59. mm_page_html = requests.get(mm_url, headers=header).text
  60. mm_page_max_num = get_page_max_num(mm_page_html, flag=2)
  61. for index in range(1, mm_page_max_num + 1):
  62. photo_url = f'{mm_url}/{index}'
  63. photo_html = requests.get(photo_url, headers=header).text
  64. s_photo_html = etree.HTML(text=photo_html)
  65. img_url = s_photo_html.xpath('//div[@class="main-image"]//img')[0].get('src')
  66. # print(img_url)
  67. r = requests.get(img_url, headers=header)
  68. if r.status_code == 200:
  69. with open(f'{mm_dir}/{index}.jpg', 'wb') as f:
  70. f.write(r.content)
  71. else:
  72. print(f'status code : {r.status_code}')
  73. else:
  74. print(f'【{title}】下载完成')
  75. print(f'第【{page_num}】页完成')
  76. if __name__ == '__main__':
  77. main()
  78. pass

运行成功后,会在脚本所在的目录 生成对应目录,每个目录里面都有对应的图片。。。。。

多线程版本

从 Python3.2开始,Python 标准库提供了 concurrent.futures 模块, concurrent.futures 模块可以利用 multiprocessing 实现真正的平行计算。python3 自带,python2 需要安装。

代码:

  1. # coding=utf-8
  2. import requests
  3. import os
  4. from lxml import etree
  5. import sys
  6. from concurrent import futures
  7. '''
  8. reload(sys)
  9. sys.setdefaultencoding('utf-8')
  10. '''
  11. platform = 'Windows' if os.name == 'nt' else 'Linux'
  12. print(f'当前系统是 【{platform}】 系统')
  13. # http请求头
  14. header = {
  15. # ':authority': 'www点mzitu点com',
  16. # ':method': 'GET',
  17. 'accept': '*/*',
  18. 'accept-encoding': 'gzip, deflate, br',
  19. 'referer': 'https://www点mzitu点com',
  20. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
  21. '(KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36'
  22. }
  23. site_url = 'http://www点mzitu点com'
  24. url_prefix = 'http://www点mzitu点com/page/'
  25. img_save_path = 'C:/mzitu/'
  26. def get_page_max_num(page_html=None, flag=1):
  27. """
  28. :param page_html: 页面的 HTML 文本
  29. :param flag: 表示是 那个页面,1:所有妹子的列表页面。2:每个妹子单独的图片页面。
  30. :return:
  31. """
  32. # 找寻最大页数
  33. s_html = etree.HTML(page_html)
  34. xpath_string = '//div[@class="nav-links"]//a' if 1 == flag \
  35. else '//div[@class="pagenavi"]//a//span'
  36. display_page_link = s_html.xpath(xpath_string)
  37. # print(display_page_link[-1].text)
  38. max_num = display_page_link[-2].text if '下一页»' == display_page_link[-1].text \
  39. else display_page_link[-1].text
  40. return int(max_num)
  41. def download_img(args_info):
  42. img_url, mm_dir, index = args_info
  43. r = requests.get(img_url, headers=header)
  44. if r.status_code == 200:
  45. with open(f'{mm_dir}/{index}.jpg', 'wb') as f:
  46. f.write(r.content)
  47. else:
  48. print(f'status code : {r.status_code}')
  49. def main():
  50. # 线程池中线程数
  51. with futures.ProcessPoolExecutor() as process_pool_executor:
  52. site_html = requests.get(site_url, headers=header).text
  53. page_max_num_1 = get_page_max_num(site_html)
  54. for page_num in range(1, page_max_num_1 + 1):
  55. page_url = f'{url_prefix}{page_num}'
  56. page_html = requests.get(page_url, headers=header).text
  57. s_page_html = etree.HTML(text=page_html)
  58. every_page_mm_url_list = s_page_html.xpath(
  59. '//ul[@id="pins"]//li[not(@class="box")]/span/a'
  60. )
  61. for tag_a in every_page_mm_url_list:
  62. mm_url = tag_a.get('href')
  63. title = tag_a.text.replace('\\', '').replace('/', '').replace(':', '')
  64. title = title.replace('*', '').replace('?', '').replace('"', '')
  65. title = title.replace('<', '').replace('>', '').replace('|', '')
  66. mm_dir = f'{img_save_path}{title}'
  67. if not os.path.exists(mm_dir):
  68. os.makedirs(mm_dir)
  69. print(f'【{title}】开始下载')
  70. mm_page_html = requests.get(mm_url, headers=header).text
  71. mm_page_max_num = get_page_max_num(mm_page_html, flag=2)
  72. for index in range(1, mm_page_max_num + 1):
  73. photo_url = f'{mm_url}/{index}'
  74. photo_html = requests.get(photo_url, headers=header).text
  75. s_photo_html = etree.HTML(text=photo_html)
  76. img_url = s_photo_html.xpath('//div[@class="main-image"]//img')[0].get('src')
  77. # 提交一个可执行的回调 task,它返回一个 Future 对象
  78. process_pool_executor.submit(download_img, (img_url, mm_dir, index))
  79. else:
  80. print(f'【{title}】下载完成')
  81. print(f'第【{page_num}】页完成')
  82. if __name__ == '__main__':
  83. main()
  84. pass

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号