当前位置:   article > 正文

【2024年最新】Bilibili/B站视频/动态评论爬虫_b站评论区爬虫

b站评论区爬虫

废话不多说,直接先放git仓库:GitHub - linyuye/Bilibili_crawler: bilibili爬虫,基于selenium获取oid与cookie,request获取api内容

〇:概念简述

oid:视频/动态的uuid,b站对于发布内容的通用唯一识别码

cookie:内含个人登录信息,爬虫头必需

其余内容请看该文档含义:

bilibili-API-collect/docs/comment/list.md at master · SocialSisterYi/bilibili-API-collect · GitHub哔哩哔哩-API收集整理【不断更新中....】. Contribute to SocialSisterYi/bilibili-API-collect development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/SocialSisterYi/bilibili-API-collect/blob/master/docs/comment/list.md

一:前言

首先,让我们了解,动态网站/静态网站的区别:

动态网站除了要设计网页外,还要通过数据库和编程序来使网站具有更多自动的和高级的功能。动态网站体现在网页一般是以aspjspphpaspx等技术,而静态网页一般是HTML标准通用标记语言的子集)结尾,动态网站服务器空间配置要比静态的网页要求高,费用也相应的高,不过动态网页利于网站内容的更新,适合企业建站。动态是相对于静态网站而言。——百度百科

通俗来说,就是网页内容是否写在网站源代码里面的区别。

以编写日期时,bilibili每周必看视频榜一为例:https://www.bilibili.com/video/BV1ay411h74i/

对于第一条评论:在网页源代码中不存在,所以数据是从后端/数据库返回的,那么我们就可以通过查看网络请求。发现数据在main?oid这里面返回得到,至此,我们知道了评论是从哪里返回的。

二:API详解

api分析来源:GitHub - SocialSisterYi/bilibili-API-collect: 哔哩哔哩-API收集整理【不断更新中....】哔哩哔哩-API收集整理【不断更新中....】. Contribute to SocialSisterYi/bilibili-API-collect development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/SocialSisterYi/bilibili-API-collect

https://github.com/SocialSisterYi/bilibili-API-collect/blob/master/docs/comment/list.md

通过翻阅上述API文档,我们得知,使用正常API获取评论,在页数到达400页之后无法返回内容,好在b站提供了一个懒加载api能够爬取所有内容。

https://api.bilibili.com/x/v2/reply/main懒加载api,无需wbi签名

https://api.bilibili.com/x/v2/reply/reply获取子评论的api

对评论区请求需要以下data数据:(1为评论,2为子评论)

参数名类型内容必要性备注
access_keystrAPP 登录 TokenAPP 方式必要
typenum评论区类型代码必要类型代码见表
oidnum目标评论区 id必要
modenum排序方式非必要默认为 3
0 3:仅按热度
1:按热度+按时间
2:仅按时间
nextnum评论页选择非必要按热度时:热度顺序页码(0 为第一页)
按时间时:时间倒序楼层号
默认为 0
psnum每页项数非必要默认为 20
定义域:1-30
access_keystrAPP登录 TokenAPP 方式必要
typenum评论区类型代码必要类型代码见表
oidnum目标评论区 id必要
rootnum根回复 rpid必要
psnum每页项数非必要默认为20
定义域:1-49
但 data_replies 的最大内容数为20,因此设置为49其实也只会有20条回复被返回
pnnum页码非必要默认为1
  1. data_2 = {
  2. 二级评论的data
  3. 'type': type, # 类型
  4. 'oid': oid, # 原视频/动态oid
  5. 'ps': ps, # 每页含有条数,不能大于20
  6. 'pn': str(page_pn), # 二级评论页数,需要转换为字符串
  7. 'root': rpid # 一级评论的rpid
  8. }

三:爬虫

使用了request库,对api访问,解析返回数据

  1. for comment in json_data['data']['replies']:
  2. count = comment['rcount']
  3. rpid = str(comment['rpid'])
  4. name = comment['member']['uname']
  5. sex = comment['member']['sex']
  6. ctime = comment['ctime']
  7. dt_object = datetime.datetime.fromtimestamp(ctime, datetime.timezone.utc)
  8. formatted_time = dt_object.strftime('%Y-%m-%d %H:%M:%S') + ' 北京时间' # 可以加上时区信息,但通常不需要
  9. like = comment['like']
  10. message = comment['content']['message'].replace('\n', ',')
  11. # 检查是否存在 location 字段
  12. location = comment['reply_control'].get('location', '未知') # 如果不存在,使用 '未知'
  13. location = location.replace('IP属地:', '') if location else location
  14. current_level = comment['member']['level_info']['current_level']
  15. mid = str(comment['member']['mid'])

四:小白化处理

因为对于小白来说,获取cookie和oid比较困难,所以使用了selenium库,自动抓包自己的cookie和视频/动态oid

  1. last_request = None
  2. # 遍历所有请求
  3. for request in driver.requests:
  4. if "main?oid=" in request.url and request.response:
  5. # 更新last_request为当前请求
  6. last_request = request
  7. # 检查是否找到了符合条件的请求
  8. if last_request:
  9. print("URL:", last_request.url)
  10. # 从URL中提取oid
  11. parsed_url = urlparse(last_request.url)
  12. query_params = parse_qs(parsed_url.query)
  13. oid = query_params.get("oid", [None])[0]
  14. type = query_params.get("type", [None])[0]
  15. print("OID:", oid)
  16. print("type:", type)

五:完整代码

  1. from seleniumwire import webdriver
  2. from selenium.webdriver.chrome.service import Service
  3. from selenium.webdriver.chrome.options import Options
  4. from selenium.webdriver.common.by import By
  5. import time
  6. from urllib.parse import urlparse, parse_qs
  7. import requests
  8. from requests.adapters import HTTPAdapter
  9. from requests.packages.urllib3.util.retry import Retry
  10. import csv
  11. import pytz
  12. import datetime
  13. from fake_useragent import UserAgent
  14. import random
  15. options = {
  16. 'ignore_http_methods': ['GET', 'POST'], # 提取XHR请求,通常为GET或POST。如果你不希望忽略任何方法,可以忽略此选项或设置为空数组
  17. 'custom_headers': {
  18. 'X-Requested-With': 'XMLHttpRequest' # 筛选XHR请求
  19. }
  20. }
  21. # 配置Selenium
  22. chrome_options = Options()
  23. chrome_service = Service("前面是你这个文件夹的绝对路径\\venv\\chrome-win64\\chromedriver.exe")
  24. driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
  25. # 打开目标网页,改成你想爬的网页,直接是网址就行
  26. driver.get("https://www.bilibili.com/video/BV1WM4m1Q75H/")
  27. login_div = driver.find_element(By.XPATH, "//div[contains(@class, 'right-entry__outside') and contains(@class, 'go-login-btn')]")
  28. login_div.click()
  29. time.sleep(5)
  30. # 注意替换下面的选择器以匹配你要自动登录的网站
  31. username_input = driver.find_element(By.XPATH, "//input[@placeholder='请输入账号']")
  32. password_input = driver.find_element(By.XPATH, "//input[@placeholder='请输入密码']")
  33. login_button = driver.find_element(By.XPATH, "//div[contains(@class,'btn_primary') and contains(text(),'登录')]")
  34. #第一个写账号,第二个写密码
  35. username_input.send_keys("")
  36. password_input.send_keys("")
  37. # 点击登录按钮
  38. time.sleep(5)
  39. # login_button.click()
  40. # 等待几秒确保登录成功
  41. driver.implicitly_wait(10) # 替换为你需要的等待时间
  42. # 等待页面加载完成(根据实际情况调整时间或使用更智能的等待方式)
  43. time.sleep(5)
  44. driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  45. time.sleep(5)
  46. driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  47. # 获取捕获的网络请求
  48. # 初始化一个变量,用来保存最后一个符合条件的请求
  49. last_request = None
  50. # 遍历所有请求
  51. for request in driver.requests:
  52. if "main?oid=" in request.url and request.response:
  53. # 更新last_request为当前请求
  54. last_request = request
  55. # 检查是否找到了符合条件的请求
  56. if last_request:
  57. print("URL:", last_request.url)
  58. # 从URL中提取oid
  59. parsed_url = urlparse(last_request.url)
  60. query_params = parse_qs(parsed_url.query)
  61. oid = query_params.get("oid", [None])[0]
  62. type = query_params.get("type", [None])[0]
  63. print("OID:", oid)
  64. print("type:", type)
  65. # 从WebDriver中获取所有cookies
  66. all_cookies = driver.get_cookies()
  67. cookies_dict = {cookie['name']: cookie['value'] for cookie in all_cookies}
  68. cookies_str = '; '.join([f"{name}={value}" for name, value in cookies_dict.items()])
  69. # 从cookies中获取bili_jct的值
  70. bili_jct = cookies_dict.get('bili_jct', '')
  71. print("bili_jct:", bili_jct)
  72. sessdata = cookies_dict.get('SESSDATA', '')
  73. print("SESSDATA:", sessdata)
  74. # 打印请求头
  75. response = last_request.response
  76. driver.quit()
  77. # 重试次数限制
  78. MAX_RETRIES = 5
  79. # 重试间隔(秒)
  80. RETRY_INTERVAL = 10
  81. file_path_1 = 'comments/主评论_1.1.csv'
  82. file_path_2 = 'comments/二级评论_1.2.csv'
  83. beijing_tz = pytz.timezone('Asia/Shanghai')#时间戳转换为北京时间
  84. ua=UserAgent()#创立随机请求头
  85. ps= 20
  86. down = 1 #开始爬的页数a
  87. up = 30#结束爬的页数
  88. one_comments = []
  89. all_comments = []#构造数据放在一起的容器 总共评论,如果只希望含有一级评论,请注释 line 144
  90. all_2_comments = []#构造数据放在一起的容器 二级评论
  91. comments_current = []
  92. comments_current_2 = []
  93. # 将所有评论数据写入CSV文件
  94. with open(file_path_1, mode='a', newline='', encoding='utf-8-sig') as file:
  95. writer = csv.writer(file)
  96. writer.writerow(['昵称', '性别', '时间', '点赞', '评论', 'IP属地','二级评论条数','等级','uid','rpid'])
  97. writer.writerows(all_comments)
  98. with open(file_path_2, mode='a', newline='', encoding='utf-8-sig') as file:#二级评论条数
  99. writer = csv.writer(file)
  100. writer.writerow(['昵称', '性别', '时间', '点赞', '评论', 'IP属地','二级评论条数,条数相同说明在同一个人下面','等级','uid','rpid'])
  101. writer.writerows(all_2_comments)
  102. with requests.Session() as session:
  103. retries = Retry(total=3, # 最大重试次数,好像没有这个函数
  104. backoff_factor=0.1, # 间隔时间会乘以这个数
  105. status_forcelist=[500, 502, 503, 504])
  106. for page in range(down, up + 1):
  107. for retry in range(MAX_RETRIES):
  108. try:
  109. headers = {
  110. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',#'ua.random',#随机请求头,b站爸爸别杀我,赛博佛祖保佑
  111. 'Cookie': cookies_str,
  112. 'SESSDATA': sessdata,
  113. 'csrf' : bili_jct,
  114. }
  115. url = 'https://api.bilibili.com/x/v2/reply?'#正常api,只能爬8k
  116. url_long = 'https://api.bilibili.com/x/v2/reply/main'#懒加载api,理论无上限
  117. url_reply = 'https://api.bilibili.com/x/v2/reply/reply'#评论区回复api
  118. #示例:https://api.bilibili.com/x/v2/reply/main?next=1&type=1&oid=544588138&mode=3(可访问网站)
  119. data = {
  120. 'next':str(page), # 页数,需要转换为字符串,与pn同理,使用懒加载api
  121. 'type': type, # 类型 11个人动态 17转发动态 视频1)
  122. 'oid': oid, #id,视频为av,文字动态地址栏id,可自查
  123. 'ps':ps, #(每页含有条数,不能大于20)用long的话不能大于30
  124. 'mode': '3' #3为热度 0 3:仅按热度 1:按热度+按时间 2:仅按时间 使用懒加载api
  125. }
  126. proxies = {
  127. #"http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},
  128. #"https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}
  129. #代理ip来源:https://www.kuaidaili.com/free/inha/
  130. }
  131. prep = session.prepare_request(requests.Request('GET', url_long, params=data, headers=headers))
  132. print(prep.url)
  133. response = session.get(url_long, params=data, headers=headers)
  134. # 检查响应状态码是否为200,即成功
  135. if response.status_code == 200:
  136. json_data = response.json()#获得json数据
  137. if 'data' in json_data and 'replies' in json_data['data']: #以下为核心内容,爬取的数据
  138. for comment in json_data['data']['replies']:
  139. #one_comments.clear()
  140. count = comment['rcount']
  141. rpid = str(comment['rpid'])
  142. name = comment['member']['uname']
  143. sex = comment['member']['sex']
  144. ctime = comment['ctime']
  145. dt_object = datetime.datetime.fromtimestamp(ctime, datetime.timezone.utc)
  146. formatted_time = dt_object.strftime('%Y-%m-%d %H:%M:%S') + ' 北京时间' # 可以加上时区信息,但通常不需要
  147. like = comment['like']
  148. message = comment['content']['message'].replace('\n', ',')
  149. # 检查是否存在 location 字段
  150. location = comment['reply_control'].get('location', '未知') # 如果不存在,使用 '未知'
  151. location = location.replace('IP属地:', '') if location else location
  152. # 将提取的信息追加到列表中
  153. current_level = comment['member']['level_info']['current_level']
  154. mid = str(comment['member']['mid'])
  155. all_comments.append([name, sex, formatted_time, like, message, location,count,current_level,mid,rpid])
  156. comments_current.append([name, sex, formatted_time, like, message, location, count, current_level,mid,rpid])
  157. with open(file_path_1, mode='a', newline='', encoding='utf-8-sig') as file:
  158. writer = csv.writer(file)
  159. writer.writerows(all_comments)
  160. all_comments.clear()
  161. #每次结束,重置计数器
  162. if(count != 0):
  163. print(f"在第{page}页中含有二级评论,该条回复下面总共含有{count}个二级评论")
  164. total_pages = ((count // 20 ) +2) if count > 0 else 0
  165. for page_pn in range(total_pages):
  166. data_2 = {
  167. # 二级评论的data
  168. 'type': type, # 类型
  169. 'oid': oid, # id
  170. 'ps': ps, # 每页含有条数,不能大于20
  171. 'pn': str(page_pn), # 二级评论页数,需要转换为字符串
  172. 'root': rpid # 一级评论的rpid
  173. }
  174. if page_pn == 0:
  175. continue
  176. response = session.get(url_reply, params=data_2, headers=headers, proxies=proxies)
  177. prep = session.prepare_request(requests.Request('GET', url_reply, params=data_2, headers=headers))
  178. print(prep.url)
  179. if response.status_code == 200:
  180. json_data = response.json() # 获得json数据
  181. if 'data' in json_data and 'replies' in json_data['data']:
  182. if not json_data['data']['replies']: # 检查replies是否为空,如果为空,跳过这一页
  183. print(f"该页replies为空,没有评论")
  184. continue
  185. for comment in json_data['data']['replies']:
  186. rpid = str(comment['rpid'])
  187. name = comment['member']['uname']
  188. sex = comment['member']['sex']
  189. ctime = comment['ctime']
  190. dt_object = datetime.datetime.fromtimestamp(ctime,datetime.timezone.utc)
  191. formatted_time = dt_object.strftime('%Y-%m-%d %H:%M:%S') + ' 北京时间' # 可以加上时区信息,但通常不需要
  192. like = comment['like']
  193. message = comment['content']['message'].replace('\n', ',')
  194. # 检查是否存在 location 字段
  195. location = comment['reply_control'].get('location','未知') # 如果不存在,使用 '未知'
  196. location = location.replace('IP属地:', '') if location else location
  197. current_level = comment['member']['level_info']['current_level']
  198. mid = str(comment['member']['mid'])
  199. all_2_comments.append([name, sex, formatted_time, like, message, location, count,current_level,mid,rpid])
  200. comments_current_2.append([name, sex, formatted_time, like, message, location, count,current_level,mid,rpid])
  201. with open(file_path_2, mode='a', newline='',encoding='utf-8-sig') as file: # 二级评论条数
  202. writer = csv.writer(file)
  203. writer.writerows(all_2_comments)
  204. all_2_comments.clear()
  205. else:
  206. #print(f"在第{page_pn + 1}页的JSON响应中缺少 'data' 或 'replies' 键。跳过此页。")
  207. print(f"在页面{page}下第{page_pn + 1}条评论没有子评论。")
  208. else:
  209. print(f"获取第{page_pn + 1}页失败。状态码: {response.status_code}")
  210. random_number = random.uniform(0.2, 0.3)
  211. time.sleep(random_number)
  212. print(f"已经爬取第{page}页. 状态码: {response.status_code} ")
  213. else:
  214. print(f"在页面 {page} 的JSON响应中缺少 'data' 或 'replies' 键。跳过此页。")
  215. else:
  216. print(f"获取页面 {page} 失败。状态码: {response.status_code} 即为失败,请分析原因并尝试重试")
  217. random_number = random.uniform(0.2, 0.3)
  218. print(random_number)
  219. time.sleep(random_number)
  220. break
  221. except requests.exceptions.RequestException as e:
  222. print(f"连接失败: {e}")
  223. if retry < MAX_RETRIES - 1:
  224. print(f"正在重试(剩余尝试次数:{MAX_RETRIES - retry - 1})...")
  225. time.sleep(RETRY_INTERVAL) # 等待一段时间后重试
  226. else:
  227. raise # 如果达到最大重试次数,则抛出原始异常

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/代码探险家/article/detail/805538
推荐阅读
相关标签
  

闽ICP备14008679号