当前位置:   article > 正文

爬虫实战 | 手把手用Python教你采集&可视化知乎问题的回答(内附代码)_爬取知乎

爬取知乎

在这里插入图片描述
在这里插入图片描述

爬虫设计流程

1、探寻网址规律
2、尝试对某一网页访问
3、解析感兴趣的数据
4、存储到csv
5、整理汇总代码

一、探寻网址规律

1、按F12键打开 开发者工具
2、选中network面板,点击 查看全部6217个回答
3、准备观察开发者工具中的监测到的网址
4、对每个网址经过下图456操作
5、点击preview
6、查看content与当前页面的回答是否一致
7、最终发现网址如7中的红色方框,请求方式为GET方法

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

8、依旧是7所在的页面,滑动到最下方,可以看到offset和limit
发现的网址(注意最后一行的offset)

https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset=3&limit=5&sort_by=default&platform=desktop
  • 1

中也存在offset,该单词的意思是偏移量。

  • offset 我猜测该值类似于page页面数
  • limit 每个url能展现多少个回答,默认5个。

网址模板(注意模板内最后一行offset)

https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={
   offset}&limit=5&sort_by=default&platform=desktop
  • 1
  • 2

当前回答一共有6200多个,每页5个,那么offset可以有1240页。

二、尝试对某一网页访问

import requests

template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'

#for page in range(1, 1240):
    
url = template.format(offset=1)
    
headers = {
   'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}

resp = requests.get(url, headers=headers)

resp
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

<Response [200]>

我们注意到5中数据可以展开,大概率是json样式的数据。

在这里插入图片描述

所以尝试使用resp.json()来拿到格式化的字典数据。

在这里插入图片描述

三、解析感兴趣的数据

由于resp.json()返回的是字典,解析定位数据会很方便。

咱们就需求简单点,只采集author、id、excerpt三个字段。

其中author内含有很多更丰富的信息,感兴趣的可以把author再整理下,本文对author不做过多清洗。

for info in resp.json()['data&
    声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/695961
    推荐阅读
    相关标签
      

    闽ICP备14008679号