当前位置:   article > 正文

爬取b站的评论并进行可视化_通过uid爬取b站某个人的评论

通过uid爬取b站某个人的评论

爬取得到的csv表格以及可视化图形展示:

csv表格截图:

 2D条形图以及3D条形图:

提前需要pip安装的三个库

一个异步加载请求库

pip install selenium (下载对应版本可用selenium == <版本号>)

两个可视化库(任选其一即可):

pip install matplotlib(下载对应版本同以上方法)

pip install pyecharts (下载对应版本同以上方法)

两种获取b站评论的方法

由于b站api参数更改了,未能够找到合适的方式进行分页爬取,故采用以下两种方式

保存多个url地址

打开开发者工具,找到图中红圈的位置

打开后查看其中的json信息找到replies,如下图所示:

每一则url会包含20条评论信息

其中包含用户的评论内容(content)评论地址(ip)用户名(uname)等信息

刷新找到多条含有url地址的保存进列表中,如以下所示

  1. ulist = [ "https://api.bilibili.com/x/v2/reply/wbi/main?oid=320392432&type=1&mode=3&pagination_str=%7B%2"
  2. "2offset%22:%22%7B%5C%22type%5C%22:1,%5C%22direction%5C%22:1,%5C%22session_id%5C%22:%5C%221734639"
  3. "697397073%5C%22,%5C%22data%5C%22:%7B%7D%7D%22%7D&plat=1&web_location=1315875&w_rid=0b96518e2f520"
  4. "2e2b4036fb3d596d4ff&wts=1693984070",
  5. "https://api.bilibili.com/x/v2/reply/wbi/main?oid=320392432&type=1&mode=3&pagination_str=%7B%22offset"
  6. "%22:%22%7B%5C%22type%5C%22:1,%5C%22direction%5C%22:1,%5C%22session_id%5C%22:%5C%221734639697397073%5C%2"
  7. "2,%5C%22data%5C%22:%7B%7D%7D%22%7D&plat=1&web_location=1315875&w_rid=27358d1f64a9e52b91756210beee635d&w"
  8. "ts=1693984100"]
'
运行

此种方式缺点时需要手动去寻找对应的url地址 

异步加载方式获取视频的xhr信息:

首先需要获取b站的cookies文件。

首先是需要读取cookies文件

  1. ListCookies = []
  2. with open('ACookies.txt', 'r') as fw:
  3. for line in fw:
  4. cookie = json.loads(line.strip()) # Parse JSON data from each line
  5. ListCookies.append(cookie)

将获取的cookies文件请求一次b站

  1. chrome_options = Options()
  2. chrome_options.add_argument('--headless')
  3. chrome_options.add_argument('--disable-gpu')
  4. driver = webdriver.Chrome(options=chrome_options)
  5. # 访问网页
  6. driver.get('https://www.bilibili.com/')
  7. print("正在解析对应的评论网址~~~")
  8. # 用cookies登录b站
  9. for cookie in ListCookies:
  10. driver.add_cookie(cookie)
  11. driver.get('https://www.bilibili.com/video/BV1fF411k7Vf/?spm_id_from=33'
  12. '3.1007.tianma.4-2-12.click&vd_source=1f033f5a233d6a47a02edcf7b98db3e8')
  13. # 等待一些时间,以确保页面加载完成(你可以根据需要使用等待条件)
  14. time.sleep(20)

 接下来只需要三个函数,

正则表达式匹配目标url:

  1. def target_url():
  2. url_pattern = r"https?://[^\s/$.?#].[^\s]*\/main\?[^\s]*"
  3. xhr_requests = driver.requests
  4. for request in xhr_requests:
  5. if re.search(url_pattern, request.url):
  6. UrlList.append(request.url)
  7. print("Match the correct URL!")
  8. print(request.url)
  9. print(request.response.status_code)
'
运行

时间戳转化

  1. def trans_date(v_timestamp):
  2. timeArray = time.localtime(v_timestamp)
  3. otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
  4. return otherStyleTime
'
运行

提取评论的json数据保存位csv格式

  1. def write_to_csv():
  2. # 爬取所需数据
  3. print(UrlList)
  4. comments_data = [
  5. ['uname', 'sex', 'sign', 'ip', 'time', 'like', 'content']
  6. ]
  7. for comment in UrlList:
  8. comment_response = requests.get(comment, cookies=cookies)
  9. if comment_response.status_code == 200:
  10. json1 = comment_response.json()
  11. replies = json1['data']['replies']
  12. for reply in replies:
  13. uname = reply['member']['uname']
  14. sex = reply['member']['sex']
  15. sign = reply['member']['sign']
  16. original_ip = reply['reply_control']['location']
  17. substring_to_remove = "IP属地:"
  18. ip = original_ip.replace(substring_to_remove, '')
  19. time1 = trans_date(reply['ctime'])
  20. like = reply['like']
  21. content = reply['content']['message']
  22. comments_data.append([uname, sex, sign, ip, time1, like, content])
  23. else:
  24. print("wrong")
  25. # 指定要保存的CSV文件路径
  26. csv_file_path = "comments.csv"
  27. # 创建一个DataFrame
  28. df = pd.DataFrame(comments_data)
  29. # 将DataFrame写入CSV文件
  30. df.to_csv(csv_file_path, index=False, header=False, encoding="utf-8")
  31. print(f"Comments have been written to {csv_file_path}")
'
运行

注意此处在使用get请求时,需要传入cookies参数。否则不能得到有含有ip地址的json文件

此种方式缺点是不能较快的获取url地址

读取csv文件及可视化

2D条形图:

读取csv文件

  1. # 读取CSV文件到Pandas DataFrame
  2. df = pd.read_csv('comments.csv')
  3. # print(df)
  4. df = df.groupby(['ip', '性别'])['点赞数'].sum().reset_index()
  5. # print(df)

由于数据中的一个ip地址需要有三个性别才能进行堆叠。所以需要用pandas文件进行数据处理。筛选出IP只对应一个性别的数据,为其添加另外两个性别,并重新组合列表。最后使用pyecharts来实现每一个性别数据的堆叠。代码实现如下

  1. # 获取每个 IP 地址的性别数量
  2. gender_counts = df.groupby('ip')['性别'].nunique().reset_index()
  3. print(gender_counts)
  4. # 为每个 IP 地址添加缺失的性别,并将点赞数设为0
  5. missing_genders = gender_counts[gender_counts['性别'] < 3]
  6. for index, row in missing_genders.iterrows():
  7. ip = row['ip']
  8. missing_genders_data = [
  9. {"ip": ip, "性别": "男", "点赞数": 0},
  10. {"ip": ip, "性别": "女", "点赞数": 0},
  11. {"ip": ip, "性别": "保密", "点赞数": 0}
  12. ]
  13. df = pd.concat([df, pd.DataFrame(missing_genders_data)])
  14. # 重新计算点赞数,将相同 IP 和性别的点赞数合并
  15. df = df.groupby(['ip', '性别'])['点赞数'].sum().reset_index()
  16. # 将 DataFrame 转换为字典列表
  17. result = df.to_dict(orient='records')
  18. # 打印结果
  19. print(result)
  20. bar = Bar()
  21. # 提取数据中的城市和性别
  22. cities = pd.Series([i['ip'] for i in result]).drop_duplicates().tolist()
  23. genders = pd.Series([i['性别'] for i in result]).drop_duplicates().tolist()
  24. print(cities)
  25. print(genders)
  26. # 遍历性别列表,为每个性别创建一组数据
  27. for gender in genders:
  28. likes = [item['点赞数'] for item in result if item['性别'] == gender]
  29. bar.add_xaxis(cities) # X 轴数据为城市
  30. bar.add_yaxis(gender, likes, stack="stack") # Y 轴数据为点赞数,堆叠方式
  31. # 设置全局选项
  32. bar.set_global_opts(
  33. title_opts=opts.TitleOpts(title="各城市各性别点赞数堆叠条形图"),
  34. xaxis_opts=opts.AxisOpts(type_="category"),
  35. yaxis_opts=opts.AxisOpts(type_="value"),
  36. )
  37. # 渲染图表到 HTML 文件中
  38. bar.render("stacked_bar_chart.html")

3D直接用pyecharts就能够直接转化为3D条形图,代码如下

  1. bar3d = (
  2. Bar3D()
  3. .add(
  4. '',
  5. df[['ip', '性别', '点赞数']].values.tolist(),
  6. xaxis3d_opts=opts.Axis3DOpts(df['ip'].unique().tolist(), type_="category"),
  7. yaxis3d_opts=opts.Axis3DOpts(df['性别'].unique().tolist(), type_="category"),
  8. zaxis3d_opts=opts.Axis3DOpts(type_="value"),
  9. )
  10. .set_global_opts(
  11. visualmap_opts=opts.VisualMapOpts(max_=df['点赞数'].max()),
  12. title_opts=opts.TitleOpts(title="IP对应的性别和点赞的3D条形图"),
  13. )
  14. )
  15. # 渲染图表
  16. bar3d.render("bar3d_chart.html")

源码展示:

解析评论并保存为csv部分

  1. import requests
  2. import time
  3. import pandas as pd
  4. # 请求的url地址列表:
  5. UrlList = [
  6. ]
  7. # 请求评论url地址时需要的cookies
  8. cookies = {
  9. }
  10. # 转化评论时间
  11. def trans_date(v_timestamp):
  12. """10位时间戳转换为时间字符串"""
  13. timeArray = time.localtime(v_timestamp)
  14. otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
  15. return otherStyleTime
  16. # 将文件写入csv中
  17. def write_to_csv():
  18. # 爬取所需数据
  19. comments_data = [
  20. ['用户名', '性别', '个性签名', 'ip', '评论时间', '点赞数', '内容']
  21. ]
  22. for comment in UrlList:
  23. comment_response = requests.get(comment, cookies=cookies)
  24. if comment_response.status_code == 200:
  25. json = comment_response.json()
  26. replies = json['data']['replies']
  27. for reply in replies:
  28. uname = reply['member']['uname']
  29. sex = reply['member']['sex']
  30. sign = reply['member']['sign']
  31. original_ip = reply['reply_control']['location']
  32. substring_to_remove = "IP属地:"
  33. ip = original_ip.replace(substring_to_remove, '')
  34. time1 = trans_date(reply['ctime'])
  35. like = reply['like']
  36. content = reply['content']['message']
  37. # data = {'uname': uname, 'sex': sex, 'sign': sign, 'content': content}
  38. comments_data.append([uname, sex, sign, ip, time1, like, content])
  39. else:
  40. print("wrong")
  41. time.sleep(5)
  42. # 指定要保存的CSV文件路径
  43. csv_file_path = "comments.csv"
  44. # 创建一个DataFrame
  45. df = pd.DataFrame(comments_data)
  46. # 将DataFrame写入CSV文件
  47. df.to_csv(csv_file_path, index=False, header=False, encoding="utf-8")
  48. print(f"Comments have been written to {csv_file_path}")
  49. res = requests.get("https://api.bilibili.com/x/v2/reply/wbi/main?oid=320392432&type=1&mode=3&pagination_str=%7B%2"
  50. "2offset%22:%22%7B%5C%22type%5C%22:1,%5C%22direction%5C%22:1,%5C%22session_id%5C%22:%5C%221734639"
  51. "697397073%5C%22,%5C%22data%5C%22:%7B%7D%7D%22%7D&plat=1&web_location=1315875&w_rid=0b96518e2f520"
  52. "2e2b4036fb3d596d4ff&wts=1693984070", cookies=cookies)
  53. print(res.status_code)
  54. print("初次请求已完成,正在写入csv数据~~~")
  55. if __name__ == '__main__':
  56. print(UrlList)
  57. write_to_csv()

将csv数据处理并可视化部分

  1. import pandas as pd
  2. from pyecharts.charts import Bar
  3. from pyecharts import options as opts
  4. from pyecharts.charts import Bar3D
  5. # 读取CSV文件到Pandas DataFrame
  6. df = pd.read_csv('comments.csv')
  7. # print(df)
  8. df = df.groupby(['ip', '性别'])['点赞数'].sum().reset_index()
  9. # print(df)
  10. # 获取每个 IP 地址的性别数量
  11. gender_counts = df.groupby('ip')['性别'].nunique().reset_index()
  12. print(gender_counts)
  13. # 为每个 IP 地址添加缺失的性别,并将点赞数设为0
  14. missing_genders = gender_counts[gender_counts['性别'] < 3]
  15. for index, row in missing_genders.iterrows():
  16. ip = row['ip']
  17. missing_genders_data = [
  18. {"ip": ip, "性别": "男", "点赞数": 0},
  19. {"ip": ip, "性别": "女", "点赞数": 0},
  20. {"ip": ip, "性别": "保密", "点赞数": 0}
  21. ]
  22. df = pd.concat([df, pd.DataFrame(missing_genders_data)])
  23. # 重新计算点赞数,将相同 IP 和性别的点赞数合并
  24. df = df.groupby(['ip', '性别'])['点赞数'].sum().reset_index()
  25. # 将 DataFrame 转换为字典列表
  26. result = df.to_dict(orient='records')
  27. # 打印结果
  28. print(result)
  29. bar = Bar()
  30. # 提取数据中的城市和性别
  31. cities = pd.Series([i['ip'] for i in result]).drop_duplicates().tolist()
  32. genders = pd.Series([i['性别'] for i in result]).drop_duplicates().tolist()
  33. print(cities)
  34. print(genders)
  35. # 遍历性别列表,为每个性别创建一组数据
  36. for gender in genders:
  37. likes = [item['点赞数'] for item in result if item['性别'] == gender]
  38. bar.add_xaxis(cities) # X 轴数据为城市
  39. bar.add_yaxis(gender, likes, stack="stack") # Y 轴数据为点赞数,堆叠方式
  40. # 设置全局选项
  41. bar.set_global_opts(
  42. title_opts=opts.TitleOpts(title="各城市各性别点赞数堆叠条形图"),
  43. xaxis_opts=opts.AxisOpts(type_="category"),
  44. yaxis_opts=opts.AxisOpts(type_="value"),
  45. )
  46. # 渲染图表到 HTML 文件中
  47. bar.render("stacked_bar_chart.html")
  48. # 用3d图来表示数据的可视化
  49. bar3d = (
  50. Bar3D()
  51. .add(
  52. '',
  53. df[['ip', '性别', '点赞数']].values.tolist(),
  54. xaxis3d_opts=opts.Axis3DOpts(df['ip'].unique().tolist(), type_="category"),
  55. yaxis3d_opts=opts.Axis3DOpts(df['性别'].unique().tolist(), type_="category"),
  56. zaxis3d_opts=opts.Axis3DOpts(type_="value"),
  57. )
  58. .set_global_opts(
  59. visualmap_opts=opts.VisualMapOpts(max_=df['点赞数'].max()),
  60. title_opts=opts.TitleOpts(title="IP对应的性别和点赞的3D条形图"),
  61. )
  62. )
  63. # 渲染图表
  64. bar3d.render("bar3d_chart.html")

如何错误或则该进之处,欢迎各位指出!

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号