当前位置:   article > 正文

python爬虫之爬取微博评论(4)_结合网络爬虫技术,实现对微博上某个话题评论的爬取。

结合网络爬虫技术,实现对微博上某个话题评论的爬取。

一、获取单页评论

随机选取一个微博,例如下面这个 

【#出操死亡女生家属... - @冷暖视频的微博 - 微博 (weibo.com)

1、fn+f12,然后点击网络,搜索评论内容,然后预览,就可以查看到网页内容里面还有评论内容

2、编写代码,获取网页信息,url是点击网络,然后点击标头,就会出现一个请求url 

  1. # requets是一个爬虫的第三方库,需要单独安装
  2. import requests
  3. # url是一访问网站的地址
  4. url = 'https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=5024086457849590&is_show_bulletin=2&is_mix=0&count=10&uid=6532102014&fetch_level=0&locale=zh-CN'
  5. # 1.发送请求
  6. response = requests.get(url=url)
  7. response.encoding = 'gbk' #防止中文乱码
  8. # 2.打印网页数据
  9. print(response.text)

3、但是我们发现这个并不是像我们想的一样,将网页的文本用文字的方式呈现,那么接下来我们要用到一个方法,让我们能够用文字的方式输出网页数据----->定义请求头。为什么要定义请求头,因为从网站的设置初衷,它是不想服务于我们的爬虫程序的,网站正常只想服务于客户的访问服务,那么我们要将我们的爬虫程序伪装成为正常的服务。通常我们只需要设置 cookie 、referee、user-agent就够了(如果有些特殊的网站可能需要我们有其他的参数)  

点击网络,点击标头,标头界面向下划动,找到 cookie 、referer 、user-agent  。

  1. import requests
  2. # 请求头
  3. headers = {
  4. # 用户身份信息
  5. 'cookie': 'XSRF-TOKEN=ZcbKzr5C4_40k_yYwTHIEj7k; PC_TOKEN=4496aa7595; login_sid_t=e6e7e18ba091dcafc2013982ea8fa895; cross_origin_proto=SSL; WBStorage=267ec170|undefined; _s_tentry=cn.bing.com; UOR=cn.bing.com,weibo.com,cn.bing.com; Apache=1040794478428.4973.1713353134174; SINAGLOBAL=1040794478428.4973.1713353134174; ULV=1713353134177:1:1:1:1040794478428.4973.1713353134174:; wb_view_log=1287*8051.9891666173934937; SUB=_2A25LG8JKDeRhGeFJ71MX9S3Lzj2IHXVoWVuCrDV8PUNbmtANLVnQkW9Nf_9NjBaZTVOL8AH-RMG38C00YruaYRtp; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5kReTqg1GgdNOFZ.hiJO0G5NHD95QNS0BpSo-0S0-pWs4DqcjMi--NiK.Xi-2Ri--ciKnRi-zNS0MXeKqfe0MfeBtt; ALF=02_1715945242; WBPSESS=FH255CAr_cfIbZ29-Y520e5NsKBGpFZni0Bmim3vDfjXHIPxgSSGqvAfC_UQmc3W2RLUzLHkkX4YI-_Pn1KHeHJhkeHw5kFxeJYgMYDr9t5bvBCMRkcG_IvV3Y2XiVRlu9ZS91UwfD5AH5MY7jhkfw==',
  6. # 防盗链
  7. 'referer': 'https://weibo.com/6532102014/Oa6B7wW2i',
  8. # 浏览器基本信息
  9. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0'
  10. }
  11. url = 'https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=5024086457849590&is_show_bulletin=2&is_mix=0&count=10&uid=6532102014&fetch_level=0&locale=zh-CN'
  12. # 1.发送请求
  13. response = requests.get(url=url, headers=headers)
  14. # 2.打印网页数据
  15. print(response.json()['data'][0]['text_raw'])

4、获取更多字节内容,然后存储到excel表格中

  1. import requests
  2. import csv
  3. f = open('评论.csv', mode='a', encoding='utf-8-sig', newline='')
  4. csv_write = csv.writer((f))
  5. csv_write.writerow(['id', 'screen_name', 'text_raw', 'like_counts', 'total_number', 'created_at'])
  6. # 请求头
  7. headers = {
  8. # 用户身份信息
  9. 'cookie': 'XSRF-TOKEN=ZcbKzr5C4_40k_yYwTHIEj7k; PC_TOKEN=4496aa7595; login_sid_t=e6e7e18ba091dcafc2013982ea8fa895; cross_origin_proto=SSL; WBStorage=267ec170|undefined; _s_tentry=cn.bing.com; UOR=cn.bing.com,weibo.com,cn.bing.com; Apache=1040794478428.4973.1713353134174; SINAGLOBAL=1040794478428.4973.1713353134174; ULV=1713353134177:1:1:1:1040794478428.4973.1713353134174:; wb_view_log=1287*8051.9891666173934937; SUB=_2A25LG8JKDeRhGeFJ71MX9S3Lzj2IHXVoWVuCrDV8PUNbmtANLVnQkW9Nf_9NjBaZTVOL8AH-RMG38C00YruaYRtp; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5kReTqg1GgdNOFZ.hiJO0G5NHD95QNS0BpSo-0S0-pWs4DqcjMi--NiK.Xi-2Ri--ciKnRi-zNS0MXeKqfe0MfeBtt; ALF=02_1715945242; WBPSESS=FH255CAr_cfIbZ29-Y520e5NsKBGpFZni0Bmim3vDfjXHIPxgSSGqvAfC_UQmc3W2RLUzLHkkX4YI-_Pn1KHeHJhkeHw5kFxeJYgMYDr9t5bvBCMRkcG_IvV3Y2XiVRlu9ZS91UwfD5AH5MY7jhkfw==',
  10. # 防盗链
  11. 'referer': 'https://weibo.com/6532102014/Oa6B7wW2i',
  12. # 浏览器基本信息
  13. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0'
  14. }
  15. url = 'https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=5024086457849590&is_show_bulletin=2&is_mix=0&count=10&uid=6532102014&fetch_level=0&locale=zh-CN'
  16. # 1.发送请求
  17. response = requests.get(url=url, headers=headers)
  18. # 2.获取数据
  19. # 3.提取数据
  20. json_data = response.json()
  21. data_list = json_data['data']
  22. for data in data_list:
  23. text_raw = data['text_raw']
  24. id = data['id']
  25. created_at = data['created_at']
  26. like_counts = data['like_counts']
  27. total_number = data['total_number']
  28. screen_name = data['user']['screen_name']
  29. print(id, screen_name, text_raw, like_counts, total_number, created_at)
  30. # 4.保存数据
  31. csv_write.writerow([id, screen_name, text_raw, like_counts, total_number, created_at])

二、获取多页评论

仔细观察,多条评论应该都会包含buildComments字段

 然后搜索buildComments字段,点进去看,会发现每一个路径都对应20条评论,但是我们人为输入每一条路径不显示,所以我们需要观察他们的规律

我们发现除开第一个分页格式不一样,其他分页的url完全一致,但是max_id的值不一样。我们可以多观察几组(例如第二个分页的max_id是第三个分页URL中的max_id)

 因此代码如下,

  1. import requests
  2. import csv
  3. f = open('评论.csv', mode='a', encoding='utf-8-sig', newline='')
  4. csv_write = csv.writer((f))
  5. csv_write.writerow(['id', 'screen_name', 'text_raw', 'like_counts', 'total_number', 'created_at'])
  6. # 请求头
  7. headers = {
  8. # 用户身份信息
  9. 'cookie': 'XSRF-TOKEN=ZcbKzr5C4_40k_yYwTHIEj7k; PC_TOKEN=4496aa7595; login_sid_t=e6e7e18ba091dcafc2013982ea8fa895; cross_origin_proto=SSL; WBStorage=267ec170|undefined; _s_tentry=cn.bing.com; UOR=cn.bing.com,weibo.com,cn.bing.com; Apache=1040794478428.4973.1713353134174; SINAGLOBAL=1040794478428.4973.1713353134174; ULV=1713353134177:1:1:1:1040794478428.4973.1713353134174:; wb_view_log=1287*8051.9891666173934937; SUB=_2A25LG8JKDeRhGeFJ71MX9S3Lzj2IHXVoWVuCrDV8PUNbmtANLVnQkW9Nf_9NjBaZTVOL8AH-RMG38C00YruaYRtp; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5kReTqg1GgdNOFZ.hiJO0G5NHD95QNS0BpSo-0S0-pWs4DqcjMi--NiK.Xi-2Ri--ciKnRi-zNS0MXeKqfe0MfeBtt; ALF=02_1715945242; WBPSESS=FH255CAr_cfIbZ29-Y520e5NsKBGpFZni0Bmim3vDfjXHIPxgSSGqvAfC_UQmc3W2RLUzLHkkX4YI-_Pn1KHeHJhkeHw5kFxeJYgMYDr9t5bvBCMRkcG_IvV3Y2XiVRlu9ZS91UwfD5AH5MY7jhkfw==',
  10. # 防盗链
  11. 'referer': 'https://weibo.com/6532102014/Oa6B7wW2i',
  12. # 浏览器基本信息
  13. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0'
  14. }
  15. def get_next(next='count=10'):
  16. url = f'https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=5024086457849590&is_show_bulletin=2&is_mix=0&{next}&count=20&uid=6532102014&fetch_level=0&locale=zh-CN'
  17. response = requests.get(url=url, headers=headers)
  18. json_data = response.json()
  19. data_list = json_data['data']
  20. max_id = json_data['max_id']
  21. for data in data_list:
  22. text_raw = data['text_raw']
  23. id = data['id']
  24. created_at = data['created_at']
  25. like_counts = data['like_counts']
  26. total_number = data['total_number']
  27. screen_name = data['user']['screen_name']
  28. print(id, screen_name, text_raw, like_counts, total_number, created_at)
  29. csv_write.writerow([id, screen_name, text_raw, like_counts, total_number, created_at])
  30. if max_id != 0:
  31. max_str = 'max_id=' + str(max_id)
  32. get_next(max_str)
  33. get_next()

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号