当前位置:   article > 正文

爬虫爬取百度图片——大量、多张图片_如何爬取百度图片

如何爬取百度图片
大家好,本文为大家讲解的是使用爬虫爬取百度图片
一、导入相应的库

二、获取网站源码

使用"utf-8"的编码方式

三、正则表达式

四、获取图片的二进制源码

五、保存图片

六、定义一个新建文件夹程序

七、定义main函数调用get_html函数
  1. # 定义main函数调用get_html函数
  2. def main():
  3. # 输入文件夹的名字
  4. fold_name = input("请输入您要抓取的图片名字:")
  5. # 输入要抓取的图片页数
  6. page_num = input("请输入要抓取多少页? (0. 1. 2. 3. .....)")
  7. # 调用函数,创建文件夹
  8. create_fold(fold_name)
  9. # 定义图片名字
  10. pic_name = 0
  11. # 构建循环,控制页面
  12. for i in range(int(page_num)):
  13. url = "https://image.baidu.com/search/acjson?tn=resultjson_com&logid=10039095042888395480&ipn=rj&ct=201326592&is=&fp=result&fr=ala&word=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87&queryWord=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&expermode=&nojc=&isAsync=&pn=60&rn=30&gsm=3c&1695863795803="
  14. headers = {
  15. "Accept": "text/plain, */*; q=0.01",
  16. "Accept-Encoding": "gzip, deflate",
  17. "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
  18. "Connection": "keep-alive",
  19. "Cookie": "BDqhfp=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87%26%26NaN-1undefined%26%261632%26%263; BIDUPSID=D076CA87E4CD25BA082EA0E9B5B9C82F; PSTM=1663428044; MAWEBCUID=web_fMcFGAgtkEbzDpinjKvUtGFDInsruypyhIDrXDSpxBBJoXftlZ; BAIDUID=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; indexPageSugList=%5B%22%E7%8C%AB%22%2C%22%26cl%3D2%26lm%3D-1%26ie%3Dutf-8%26oe%3Dutf-8%26adpicid%3D%26st%3D%26z%3D%26ic%3D%26hd%3D%26latest%3D%26copyright%3D%26word%3D%E5%A4%A7%E8%B1%A1%26s%3D%26se%3D%26tab%3D%26width%3D%26height%3D%26face%3D%26istype%3D%26qc%3D%26nc%3D%26fr%3D%26expermode%3D%26force%3D%26pn%3D30%26rn%3D30%22%2C%22%E6%80%A7%E6%84%9F%E7%BE%8E%E5%A5%B3%22%5D; ZFY=JujkjWiLPjOsSz:Ag1v0hFWlSBt4qjPC4L6bB4MDS6Jo:C; BAIDUID_BFESS=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=null; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; ab_sr=1.0.1_YTc4N2NiNWIyZWM5NTkzYzQ3MmZlNTI3Y2YyM2RiMTE3YmYwMTBiNzQ0YzhlZmJkZDY4YjJhZWU4NjVmMmQxZmJkYTcxODZkYTgwNjhhZDY5ZWZmYjg4Y2FmMGE5YTBmNjc3M2JhZDEwZTU1MTAyMTA1MjUxN2Y2NDNlMTJiNzhjNTIyYTQwNTg5ODNiMzc1MjRlZDdmNTVkMzdkOGJiOQ==",
  20. "Host": "image.baidu.com",
  21. "Referer": "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%B4%F3%D0%DC%C3%A8%CD%BC%C6%AC&fr=ala&ala=1&alatpl=normal&pos=0&dyTabStr=MTEsMCwxLDMsNiw1LDQsMiw3LDgsOQ%3D%3D",
  22. "Sec-Ch-Ua": '"Microsoft Edge";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
  23. "Sec-Ch-Ua-Mobile": "?0",
  24. "Sec-Ch-Ua-Platform": '"Windows"',
  25. "Sec-Fetch-Dest": "empty",
  26. "Sec-Fetch-Mode": "cors",
  27. "Sec-Fetch-Site": "same-origin",
  28. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.43",
  29. "X-Requested-With": "XMLHttpRequest",
  30. }
  31. params = {
  32. "tn": "resultjson_com",
  33. "logid": "11637882045647848541",
  34. "ipn": "rj",
  35. "ct": "201326592",
  36. "fp": "result",
  37. "fr": "ala",
  38. "word": fold_name,
  39. "queryWord": fold_name,
  40. "cl": "2",
  41. "lm": "-1",
  42. "ie": "utf-8",
  43. "oe": "utf-8",
  44. "pn": str(int(i + 1) * 30),
  45. "rn": "30",
  46. "gsm": "3c",
  47. }
  48. html = get_html(url, headers, params)
  49. # print(html)
  50. result = parse_pic_url(html)
  51. # 使用for循环遍历列表
  52. for item in result:
  53. # print(item)
  54. # 调用函数,获取图片的二进制源码
  55. pic_content = get_pic_content(item)
  56. # 调用函数保存图片
  57. save_pic(fold_name, pic_content, pic_name)
  58. pic_name += 1
  59. # print(pic_content) # 二进制源码
  60. print("正在保存" + str(pic_name) + " 张图片")

1、文件夹名字、抓取页数、图片名字

2、url

打开百度图片,点击检查

先刷新页面,再点击网络,最后点击Fetch/XHR

找到acjson

3、headers

复制后像下方那样处理,":"前后的文本都要加双引号"",文本中有双引号的就加单引号''

  1. "Accept": "text/plain, */*; q=0.01",
  2. "Accept-Encoding": "gzip, deflate",
  3. "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
  4. "Connection": "keep-alive",
  5. "Cookie": "BDqhfp=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87%26%26NaN-1undefined%26%261632%26%263; BIDUPSID=D076CA87E4CD25BA082EA0E9B5B9C82F; PSTM=1663428044; MAWEBCUID=web_fMcFGAgtkEbzDpinjKvUtGFDInsruypyhIDrXDSpxBBJoXftlZ; BAIDUID=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; indexPageSugList=%5B%22%E7%8C%AB%22%2C%22%26cl%3D2%26lm%3D-1%26ie%3Dutf-8%26oe%3Dutf-8%26adpicid%3D%26st%3D%26z%3D%26ic%3D%26hd%3D%26latest%3D%26copyright%3D%26word%3D%E5%A4%A7%E8%B1%A1%26s%3D%26se%3D%26tab%3D%26width%3D%26height%3D%26face%3D%26istype%3D%26qc%3D%26nc%3D%26fr%3D%26expermode%3D%26force%3D%26pn%3D30%26rn%3D30%22%2C%22%E6%80%A7%E6%84%9F%E7%BE%8E%E5%A5%B3%22%5D; ZFY=JujkjWiLPjOsSz:Ag1v0hFWlSBt4qjPC4L6bB4MDS6Jo:C; BAIDUID_BFESS=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=null; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; ab_sr=1.0.1_YTc4N2NiNWIyZWM5NTkzYzQ3MmZlNTI3Y2YyM2RiMTE3YmYwMTBiNzQ0YzhlZmJkZDY4YjJhZWU4NjVmMmQxZmJkYTcxODZkYTgwNjhhZDY5ZWZmYjg4Y2FmMGE5YTBmNjc3M2JhZDEwZTU1MTAyMTA1MjUxN2Y2NDNlMTJiNzhjNTIyYTQwNTg5ODNiMzc1MjRlZDdmNTVkMzdkOGJiOQ==",
  6. "Host": "image.baidu.com",
  7. "Referer": "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%B4%F3%D0%DC%C3%A8%CD%BC%C6%AC&fr=ala&ala=1&alatpl=normal&pos=0&dyTabStr=MTEsMCwxLDMsNiw1LDQsMiw3LDgsOQ%3D%3D",
  8. "Sec-Ch-Ua": '"Microsoft Edge";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
  9. "Sec-Ch-Ua-Mobile": "?0",
  10. "Sec-Ch-Ua-Platform": '"Windows"',
  11. "Sec-Fetch-Dest": "empty",
  12. "Sec-Fetch-Mode": "cors",
  13. "Sec-Fetch-Site": "same-origin",
  14. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.43",
  15. "X-Requested-With": "XMLHttpRequest",

4、params

点击负载,复制所有的文本,然后在代码中删除冒号后面没有数据的行

5、使用for循环遍历列表

八、执行main()函数

全部代码

  1. # 导入相应的库
  2. import os
  3. import re
  4. import requests
  5. # 获取网站源码
  6. def get_html(url, headers, params):
  7. response = requests.get(url, headers=headers, params=params)
  8. # 设置源代码的编码方式
  9. response.encoding = "utf-8"
  10. # return response.text
  11. if response.status_code == 200:
  12. return response.text
  13. else:
  14. print("网站源码获取错误")
  15. def parse_pic_url(html):
  16. result = re.findall('thumbURL":"(.*?)"', html, re.S)
  17. return result
  18. # 获取图片的二进制源码
  19. def get_pic_content(url):
  20. response = requests.get(url)
  21. # 设置源代码的编码方式
  22. return response.content
  23. # 保存图片
  24. def save_pic(fold_name, content, pic_name):
  25. # with open("大熊猫/" + str(pic_name) + ".jpg", "wb") as f:
  26. with open(fold_name+"/" + str(pic_name) + ".jpg", "wb") as f:
  27. f.write(content)
  28. f.close()
  29. # 定义一个新建文件夹程序
  30. def create_fold(fold_name):
  31. # 加异常处理
  32. try:
  33. os.mkdir(fold_name)
  34. except:
  35. print("文件夹已存在")
  36. # 定义main函数调用get_html函数
  37. def main():
  38. # 输入文件夹的名字
  39. fold_name = input("请输入您要抓取的图片名字:")
  40. # 输入要抓取的图片页数
  41. page_num = input("请输入要抓取多少页? (0. 1. 2. 3. .....)")
  42. # 调用函数,创建文件夹
  43. create_fold(fold_name)
  44. # 定义图片名字
  45. pic_name = 0
  46. # 构建循环,控制页面
  47. for i in range(int(page_num)):
  48. url = "https://image.baidu.com/search/acjson?tn=resultjson_com&logid=10039095042888395480&ipn=rj&ct=201326592&is=&fp=result&fr=ala&word=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87&queryWord=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&expermode=&nojc=&isAsync=&pn=60&rn=30&gsm=3c&1695863795803="
  49. headers = {
  50. "Accept": "text/plain, */*; q=0.01",
  51. "Accept-Encoding": "gzip, deflate",
  52. "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
  53. "Connection": "keep-alive",
  54. "Cookie": "BDqhfp=%E5%A4%A7%E7%86%8A%E7%8C%AB%E5%9B%BE%E7%89%87%26%26NaN-1undefined%26%261632%26%263; BIDUPSID=D076CA87E4CD25BA082EA0E9B5B9C82F; PSTM=1663428044; MAWEBCUID=web_fMcFGAgtkEbzDpinjKvUtGFDInsruypyhIDrXDSpxBBJoXftlZ; BAIDUID=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; indexPageSugList=%5B%22%E7%8C%AB%22%2C%22%26cl%3D2%26lm%3D-1%26ie%3Dutf-8%26oe%3Dutf-8%26adpicid%3D%26st%3D%26z%3D%26ic%3D%26hd%3D%26latest%3D%26copyright%3D%26word%3D%E5%A4%A7%E8%B1%A1%26s%3D%26se%3D%26tab%3D%26width%3D%26height%3D%26face%3D%26istype%3D%26qc%3D%26nc%3D%26fr%3D%26expermode%3D%26force%3D%26pn%3D30%26rn%3D30%22%2C%22%E6%80%A7%E6%84%9F%E7%BE%8E%E5%A5%B3%22%5D; ZFY=JujkjWiLPjOsSz:Ag1v0hFWlSBt4qjPC4L6bB4MDS6Jo:C; BAIDUID_BFESS=D076CA87E4CD25BA568D2D9EF1AD5F5C:SL=0:NR=10:FG=1; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=null; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; ab_sr=1.0.1_YTc4N2NiNWIyZWM5NTkzYzQ3MmZlNTI3Y2YyM2RiMTE3YmYwMTBiNzQ0YzhlZmJkZDY4YjJhZWU4NjVmMmQxZmJkYTcxODZkYTgwNjhhZDY5ZWZmYjg4Y2FmMGE5YTBmNjc3M2JhZDEwZTU1MTAyMTA1MjUxN2Y2NDNlMTJiNzhjNTIyYTQwNTg5ODNiMzc1MjRlZDdmNTVkMzdkOGJiOQ==",
  55. "Host": "image.baidu.com",
  56. "Referer": "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%B4%F3%D0%DC%C3%A8%CD%BC%C6%AC&fr=ala&ala=1&alatpl=normal&pos=0&dyTabStr=MTEsMCwxLDMsNiw1LDQsMiw3LDgsOQ%3D%3D",
  57. "Sec-Ch-Ua": '"Microsoft Edge";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
  58. "Sec-Ch-Ua-Mobile": "?0",
  59. "Sec-Ch-Ua-Platform": '"Windows"',
  60. "Sec-Fetch-Dest": "empty",
  61. "Sec-Fetch-Mode": "cors",
  62. "Sec-Fetch-Site": "same-origin",
  63. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.43",
  64. "X-Requested-With": "XMLHttpRequest",
  65. }
  66. params = {
  67. "tn": "resultjson_com",
  68. "logid": "11637882045647848541",
  69. "ipn": "rj",
  70. "ct": "201326592",
  71. "fp": "result",
  72. "fr": "ala",
  73. "word": fold_name,
  74. "queryWord": fold_name,
  75. "cl": "2",
  76. "lm": "-1",
  77. "ie": "utf-8",
  78. "oe": "utf-8",
  79. "pn": str(int(i + 1) * 30),
  80. "rn": "30",
  81. "gsm": "3c",
  82. }
  83. html = get_html(url, headers, params)
  84. # print(html)
  85. result = parse_pic_url(html)
  86. # 使用for循环遍历列表
  87. for item in result:
  88. # print(item)
  89. # 调用函数,获取图片的二进制源码
  90. pic_content = get_pic_content(item)
  91. # 调用函数保存图片
  92. save_pic(fold_name, pic_content, pic_name)
  93. pic_name += 1
  94. # print(pic_content) # 二进制源码
  95. print("正在保存" + str(pic_name) + " 张图片")
  96. # 执行main函数
  97. if __name__ == '__main__':
  98. main()

本文到这里就结束了,谢谢观看!

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号