当前位置:   article > 正文

【Python爬虫】批量爬取网页的图片&制作数据集_批量爬取某网站图片

批量爬取某网站图片

        由于Python拥有强大且丰富的类库,语法简单,效率高而被广泛运用于网络爬虫,很多人都是通过爬虫认识Python。

        因为小编最近正在做目标识别相关的项目,所以需要大量的训练样本集,从网页上一个个下载又太慢,所以就想到编写一个程序实现网页图片的批量下载。感兴趣的朋友可以将代码复制到本地运行试试,小编会在代码中写入一部分注释方便大家理解,如果有问题可以直接给我留言评论。

        本篇文章只是分享代码,所以不会详细介绍爬虫的原理和结构,不过通过代码的注释也可以让新手轻松入门。

(一)获取请求头

        首先你要知道爬虫就是模拟人去浏览网页,并通过代码实现批量获取信息的手段。所以我们在使用爬虫时,要先获取网页的请求头,你可以理解为你只有出示饭票,阿姨才会给你打饭。

1.打开你要爬取的网站,我这里是百度图片网页。然后点击键盘的“F12”进入控制台,接着按顺序点击“网络”(英文版的是Network),然后点击“XHR”,再点击左侧响应的名称,获取User-Agent、Host、Cookie信息。将这些信息制作成请求头的格式,如下述代码:

  1. headers = {
  2. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome'
  3. '/112.0.0.0 Safari/537.36 Edg/112.0.1722.58',
  4. 'Host': 'image.baidu.com',
  5. 'Cookie': 'BIDUPSID=6096EFD12C571F1D6231034147921FB8; PSTM=1682383713; BAIDUID=6096EFD12C571F1D5581B126'
  6. '79EA8E7D:FG=1; BD_UPN=12314753; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; delPer=0; BD_CK_SAM=1'
  7. '; PSINO=5; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BAIDUID_BFESS=6096EFD12C571F1D5581B12679EA8E7D:FG'
  8. '=1; userFrom=null; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR'
  9. '[A24tJn4Wkd_]=mk3SLVN4HKm; shifen[598151295075_76725]=1682411441; BCLID=10450312018116497963; B'
  10. 'CLID_BFESS=10450312018116497963; BDSFRCVID=z9_OJeC62lsu0DJfOkenUsu36Pzw6K3TH6bHQI-qy-1kcJagoI4a'
  11. 'EG0PUx8g0KuMDFkVogKK0eOTHktF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; BDSFRCVID_BFESS=z9_OJeC62lsu0DJfOkenUsu'
  12. '36Pzw6K3TH6bHQI-qy-1kcJagoI4aEG0PUx8g0KuMDFkVogKK0eOTHktF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID'
  13. '_SF=tbFqoK8bJKL3qJTph47hqR-8MxrK2JT3KC_X3b7Ef-FB_p7_bf--D4Ay5H3RBt592KTX-4OatKQmJ40CyTbxy5KVybQA'
  14. 'eRo8HR6W3hcq5b7zMbjHQT3m3JvbbN3i-xrR3D3pWb3cWKJq8UbSMnOPBTD02-nBat-OQ6npaJ5nJq5nhMJmb67JD-50exbH5'
  15. '5uHtb-e3H; H_BDCLCKID_SF_BFESS=tbFqoK8bJKL3qJTph47hqR-8MxrK2JT3KC_X3b7Ef-FB_p7_bf--D4Ay5H3RBt592K'
  16. 'TX-4OatKQmJ40CyTbxy5KVybQAeRo8HR6W3hcq5b7zMbjHQT3m3JvbbN3i-xrR3D3pWb3cWKJq8UbSMnOPBTD02-nBat-OQ6n'
  17. 'paJ5nJq5nhMJmb67JD-50exbH55uHtb-e3H; BDRCVFR[Q5XHKaSBNfR]=mk3SLVN4HKm; BA_HECTOR=0005ah85ah01akak'
  18. 'ah218kdl1i4f4661m; ZFY=2wMwDt78vksPrYmFMrRHpQ0FDKAW:BwWKHieg1S7DwzI:C; Hm_lvt_aec699bb6442ba076c89'
  19. '81c6dc490771=1682412342; Hm_lpvt_aec699bb6442ba076c8981c6dc490771=1682412342; COOKIE_SESSION=297_0'
  20. '_8_8_21_12_1_0_8_6_0_1_4899_0_356_0_1682412370_0_1682412014%7C9%230_0_1682412014%7C1; ZD_ENTRY=bai'
  21. 'du; ab_sr=1.0.1_MTQ3MDNkZDUwMWVlMDBiOTUwOTNmZTIyZWYxOTI5MjA5OGY2ZDE3MjZhODhkZTNkMjg0YjY2MDMwYjhiZDI'
  22. '2YTZhY2Y3MjRkZTQ0ZDVlNjJlNzQyZTg1NTYwMmU4MDg0MWVlOGYxYjljYzAxZmEyZTc1NDc2NTBjYjczMjBhZmY1MTcyYWQyYT'
  23. 'g0YTE1Mzc2NmUxODA3ZWU2YmE5MDM5MQ==; __bid_n=187b632a23dacdbd374207; H_PS_645EC=1b0eaVb4%2FdPDyC6op6N'
  24. 'C0mbno0FhzDP1g9C0LK2F9fx137fXB7h1o3RqkSjaSbV12NqWTbs; BD_HOME=1; H_PS_PSSID=38516_36554_38469_38368_'
  25. '38468_38485_37928_37709_38356_26350_38546'
  26. }
'
运行

(二)获取图片的下载链接

这里没什么好说的,就直接上代码了(函数中的key是你要搜索的内容,后面的完整代码可以实现交互式输入,也可以手动输入)。

  1. def Get_image_url(headers, key):
  2. print("正在获取图片下载链接......")
  3. url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word={}'.format('%s' % key)
  4. request = requests.get(url, headers=headers, timeout=10) # timeout实现网页未返回值的情况
  5. # 获取网页的代码
  6. image_urls = re.findall('"objURL":"(.*?)",', request.text, re.S)
  7. # 获取图片的下载链接,列表形式
  8. if not image_urls:
  9. print("Error:图片下载链接获取失败!")
  10. else:
  11. return image_urls
'
运行

注:这里使用的re正则解析的网页,还有很多种方式,感兴趣的朋友可以自己去了解。嫌费事就直接用我的代码就行。

(三)保存图片至本地

  1. def Write_image(image_urls, num):
  2. for i in range(0, num):
  3. image_data = requests.get(image_urls[i])
  4. # 获取下载链接中的图片信息
  5. print("正在下载第%s张图片:" % (i + 1))
  6. image_path = "G:/try/%s.jpg" % (i + 1)
  7. with open(image_path, 'wb') as fp:
  8. fp.write(image_data.content)
  9. # 写入图片信息
'
运行

这段代码中的num是可以自己选择需要下载的图片数量,可以直接改,也可以键盘输入(看后续完整代码)

(四)完整代码

  1. # -*- coding: utf-8 -*-
  2. """
  3. @Time : 2023/4/25 9:24
  4. @Auth : RS迷途小书童
  5. @File :Get_image_online.py
  6. @IDE :PyCharm
  7. """
  8. import requests
  9. import re
  10. def Get_image_url(headers, key):
  11. print("正在获取图片下载链接......")
  12. url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word={}'.format('%s' % key)
  13. request = requests.get(url, headers=headers, timeout=10) # timeout实现网页未返回值的情况
  14. # 获取网页的代码
  15. image_urls = re.findall('"objURL":"(.*?)",', request.text, re.S)
  16. # 获取图片的下载链接,列表形式
  17. if not image_urls:
  18. print("Error:图片下载链接获取失败!")
  19. else:
  20. return image_urls
  21. def Write_image(image_urls, num):
  22. for i in range(0, num):
  23. image_data = requests.get(image_urls[i])
  24. # 获取下载链接中的图片信息
  25. print("正在下载第%s张图片:" % (i + 1))
  26. image_path = "G:/try/%s.jpg" % (i + 1)
  27. # 保存图片的路径,可以自己修改
  28. with open(image_path, 'wb') as fp:
  29. fp.write(image_data.content)
  30. # 写入图片信息
  31. if __name__ == '__main__':
  32. headers = {
  33. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome'
  34. '/112.0.0.0 Safari/537.36 Edg/112.0.1722.58',
  35. 'Host': 'image.baidu.com',
  36. 'Cookie': 'BIDUPSID=6096EFD12C571F1D6231034147921FB8; PSTM=1682383713; BAIDUID=6096EFD12C571F1D5581B126'
  37. '79EA8E7D:FG=1; BD_UPN=12314753; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; delPer=0; BD_CK_SAM=1'
  38. '; PSINO=5; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BAIDUID_BFESS=6096EFD12C571F1D5581B12679EA8E7D:FG'
  39. '=1; userFrom=null; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR'
  40. '[A24tJn4Wkd_]=mk3SLVN4HKm; shifen[598151295075_76725]=1682411441; BCLID=10450312018116497963; B'
  41. 'CLID_BFESS=10450312018116497963; BDSFRCVID=z9_OJeC62lsu0DJfOkenUsu36Pzw6K3TH6bHQI-qy-1kcJagoI4a'
  42. 'EG0PUx8g0KuMDFkVogKK0eOTHktF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; BDSFRCVID_BFESS=z9_OJeC62lsu0DJfOkenUsu'
  43. '36Pzw6K3TH6bHQI-qy-1kcJagoI4aEG0PUx8g0KuMDFkVogKK0eOTHktF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID'
  44. '_SF=tbFqoK8bJKL3qJTph47hqR-8MxrK2JT3KC_X3b7Ef-FB_p7_bf--D4Ay5H3RBt592KTX-4OatKQmJ40CyTbxy5KVybQA'
  45. 'eRo8HR6W3hcq5b7zMbjHQT3m3JvbbN3i-xrR3D3pWb3cWKJq8UbSMnOPBTD02-nBat-OQ6npaJ5nJq5nhMJmb67JD-50exbH5'
  46. '5uHtb-e3H; H_BDCLCKID_SF_BFESS=tbFqoK8bJKL3qJTph47hqR-8MxrK2JT3KC_X3b7Ef-FB_p7_bf--D4Ay5H3RBt592K'
  47. 'TX-4OatKQmJ40CyTbxy5KVybQAeRo8HR6W3hcq5b7zMbjHQT3m3JvbbN3i-xrR3D3pWb3cWKJq8UbSMnOPBTD02-nBat-OQ6n'
  48. 'paJ5nJq5nhMJmb67JD-50exbH55uHtb-e3H; BDRCVFR[Q5XHKaSBNfR]=mk3SLVN4HKm; BA_HECTOR=0005ah85ah01akak'
  49. 'ah218kdl1i4f4661m; ZFY=2wMwDt78vksPrYmFMrRHpQ0FDKAW:BwWKHieg1S7DwzI:C; Hm_lvt_aec699bb6442ba076c89'
  50. '81c6dc490771=1682412342; Hm_lpvt_aec699bb6442ba076c8981c6dc490771=1682412342; COOKIE_SESSION=297_0'
  51. '_8_8_21_12_1_0_8_6_0_1_4899_0_356_0_1682412370_0_1682412014%7C9%230_0_1682412014%7C1; ZD_ENTRY=bai'
  52. 'du; ab_sr=1.0.1_MTQ3MDNkZDUwMWVlMDBiOTUwOTNmZTIyZWYxOTI5MjA5OGY2ZDE3MjZhODhkZTNkMjg0YjY2MDMwYjhiZDI'
  53. '2YTZhY2Y3MjRkZTQ0ZDVlNjJlNzQyZTg1NTYwMmU4MDg0MWVlOGYxYjljYzAxZmEyZTc1NDc2NTBjYjczMjBhZmY1MTcyYWQyYT'
  54. 'g0YTE1Mzc2NmUxODA3ZWU2YmE5MDM5MQ==; __bid_n=187b632a23dacdbd374207; H_PS_645EC=1b0eaVb4%2FdPDyC6op6N'
  55. 'C0mbno0FhzDP1g9C0LK2F9fx137fXB7h1o3RqkSjaSbV12NqWTbs; BD_HOME=1; H_PS_PSSID=38516_36554_38469_38368_'
  56. '38468_38485_37928_37709_38356_26350_38546'
  57. }
  58. key = str(input("请输入你想要获取图片的关键字:"))
  59. num = int(input("请输入你想获取图片的数量:"))
  60. image_urls = Get_image_url(headers, key)
  61. if image_urls:
  62. Write_image(image_urls, num)
  63. else:
  64. print("程序结束!")

效果图:

代码中的image_path = "G:/try/%s.jpg" % (i + 1),可以自己修改路径。搜索的关键字和图片数量可以通过键盘输入。

本次批量下载网页图片的代码分享就到这,仅供大家学习参考,有问题可以留言!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Guff_9hys/article/detail/885481
推荐阅读
相关标签
  

闽ICP备14008679号