当前位置:   article > 正文

舆情/热点聚类算法研究(一):通过python爬虫实现舆情/热点数据准备_python 自动化爬虫舆情信息采集

python 自动化爬虫舆情信息采集

目录

一、用Requests获取网页源码

1.1 过程分析

1.2 代码演示

二、网页解析与数据存储

2.1 网页解析

2.2 代码演示

三、完整代码


预备知识python基础HTML基础

爬虫:一段自动抓取互联网信息的程序,从互联网上抓取对于我们有价值的信息。

目标:获取微博社会栏目热点数据


一、用Requests获取网页源码

1.1 过程分析

首先定位到要爬取的网站链接,通过F12打开控制台,找到请求包列表查看预览,定位到所需数据所在的具体请求包,找到待爬取网页的url(标头)、参数列表(载荷)、用户Cookie和Agent(标头)。

  • url:网页地址,地址中?号后的部分为参数,可以通过调整参数(删改测试)明确网页固定参数和可变参数,固定参数可以直接写入url,非固定参数需要通过params参数来控制。
  • 参数列表:表示当前网页的参数设置
  • 用户Cookie和Agent:浏览器标识,用于request请求中模拟浏览器活动

比如现在要爬取微博热点数据:

1.2 代码演示

获取到这些信息后,就可以开始写request请求获取网页的数据源码,如下:

  1. import requests
  2. import csv
  3. import numpy as np
  4. import os
  5. import time
  6. from datetime import datetime
  7. articalUrl = 'https://weibo.com/ajax/feed/hottimeline'
  8. headers={
  9. 'Cookie' : '...',
  10. 'User-Agent': '...'
  11. }
  12. # 此为社会热点页面参数
  13. params = {
  14. 'group_id':'1028034188',
  15. 'containerid':'102803_ctg1_4188_-_ctg1_4188',
  16. 'max_id':0,
  17. 'count':20,
  18. 'extparam':'discover|new_feed'
  19. }
  20. # response得到get后的数据源码
  21. response = requests.get(articalUrl,headers = headers,params = params)

二、网页解析与数据存储

2.1 网页解析

可以观察到,我们所需的各类数据,都在statuses模块的分组内,以此不断切分出所需数据并导入目标文件即可:

2.2 代码演示

  1. def get_data(url,params):
  2. headers={
  3. 'Cookie' : '...',
  4. 'User-Agent': '...'
  5. }
  6. response = requests.get(url,headers = headers,params = params)
  7. if response.status_code == 200: # “200”标识代表GET请求成功
  8. return response.json()['statuses'] # 此处直接提取出statuses中的内容
  9. else:
  10. print('出错了')
  11. return None
  12. # 切分数据
  13. def parse_json(response):
  14. for artice in response:
  15. id = artice['id']
  16. LikeNum = artice['attitudes_count']
  17. commentsLen = artice['comments_count']
  18. reposts_count = artice['reposts_count']
  19. try:
  20. region=artice['region_name'].replace('发布于 ','')
  21. except:
  22. region = '无'
  23. content = artice['text_raw']
  24. contentLen = artice['textLength']
  25. created_at = datetime.strptime(artice['created_at'],'%a %b %d %H:%M:%S %z %Y').strftime('%Y-%m-%d %H:%M:%S')
  26. try:
  27. detailUrl = 'https://weibo.com/'+ str(artice['id']) + '/' + str(artice['mblogid'])
  28. except:
  29. detailUrl = '无'
  30. authorName = artice['user']['screen_name']
  31. authorAvatar = artice['user']['avatar_large']
  32. authorDetail = 'https://weibo.com/u/' + str(artice['user']['id'])
  33. writerRow([
  34. # id,
  35. # LikeNum,
  36. # commentsLen,
  37. # reposts_count,
  38. # region,
  39. # content,
  40. # contentLen,
  41. # created_at,
  42. # type,
  43. # detailUrl,
  44. # authorAvatar,
  45. # authorDetail
  46. content # 此处只爬取内容,可以修改
  47. ])

三、完整代码

  1. file_path = './articaleData'
  2. def init():
  3. if not os.path.exists(file_path):
  4. with open(file_path,'w',encoding='utf-8',newline='') as csvFile:
  5. writer = csv.writer(csvFile)
  6. writer.writerow([
  7. 'id',
  8. 'likeNum',
  9. 'commentsLen',
  10. 'reposts_count',
  11. 'region',
  12. 'content',
  13. 'contentLen',
  14. 'created_at',
  15. 'detailurl',
  16. 'authorAvatar',
  17. 'authorName',
  18. 'authorDetail',
  19. ])
  20. def writerRow(row):
  21. with open(file_path,'a',encoding='utf-8',newline='') as csvFile:
  22. writer = csv.writer(csvFile)
  23. writer.writerow(row)
  24. def get_data(url,params):
  25. headers={
  26. 'Cookie' : '...',
  27. 'User-Agent': '...'
  28. }
  29. response = requests.get(url,headers = headers,params = params)
  30. if response.status_code == 200:
  31. return response.json()['statuses']
  32. else:
  33. print('出错了')
  34. return None
  35. def parse_json(response):
  36. for artice in response:
  37. id = artice['id']
  38. LikeNum = artice['attitudes_count']
  39. commentsLen = artice['comments_count']
  40. reposts_count = artice['reposts_count']
  41. try:
  42. region=artice['region_name'].replace('发布于 ','')
  43. except:
  44. region = '无'
  45. content = artice['text_raw']
  46. contentLen = artice['textLength']
  47. created_at = datetime.strptime(artice['created_at'],'%a %b %d %H:%M:%S %z %Y').strftime('%Y-%m-%d %H:%M:%S')
  48. try:
  49. detailUrl = 'https://weibo.com/'+ str(artice['id']) + '/' + str(artice['mblogid'])
  50. except:
  51. detailUrl = '无'
  52. authorName = artice['user']['screen_name']
  53. authorAvatar = artice['user']['avatar_large']
  54. authorDetail = 'https://weibo.com/u/' + str(artice['user']['id'])
  55. writerRow([
  56. # id,
  57. # LikeNum,
  58. # commentsLen,
  59. # reposts_count,
  60. # region,
  61. # content,
  62. # contentLen,
  63. # created_at,
  64. # detailUrl,
  65. # authorAvatar,
  66. # authorDetail
  67. content
  68. ])
  69. def start(pageNum = 10):
  70. articalUrl = 'https://weibo.com/ajax/feed/hottimeline'
  71. init()
  72. for page in range(0,pageNum):
  73. print('正在爬取的类型: %s 中的第%s页文章数据'%('社会',page +1))
  74. parmas = {
  75. 'group_id':'1028034188',
  76. 'containerid':'102803_ctg1_4188_-_ctg1_4188',
  77. 'max_id':page,
  78. 'count':20,
  79. 'extparam':'discover|new_feed'
  80. }
  81. response = get_data(articalUrl,parmas)
  82. parse_json(response)
  83. print('爬取完毕!')
  84. if __name__ == "__main__":
  85. start()

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/531066
推荐阅读
相关标签
  

闽ICP备14008679号