赞
踩
文章参考知乎大佬 马哥python说 的文章:【爬虫案例】用Python爬取百度热搜榜数据!
爬取目标
百度热搜榜 → https://top.baidu.com/board?tab=realtime
分别爬取每条热搜的:
热搜标题、热搜简介、热度变化、热度值、热度标签、标签图片、热搜图片和链接地址
并输出为Json格式
由于参考文章已经分析出热搜榜地址,所以就不再另外抓取,直接使用即可。
不懂怎么抓取,想要学习的请看文章顶部参考文章!
//百度热搜榜地址
$url = 'https://top.baidu.com/api/board?platform=wise&tab=realtime';
由于使用是PHP,且输出内容格式为Json,因此我们需要先定义一个函数,让页面标准化输出Json。
这里我借鉴目前正在使用的 API接口平台 的系统框架的输出标准。
function msg($code = 0, $msg = '', $data = '', $debug = '') { header("Content-Type:application/json; charset=utf-8"); $end_time = microtime(true); $json = [ "code" => $code, "msg" => $msg, "data" => $data, "debug" => $debug, "exec_time" => round($end_time - start_time, 6), "ip" => user_ip, ]; if (!$json['debug']) { unset($json['debug']); } echo (json_encode( $json, JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES ) ); }
由于上面这段代码标准化输出中包含执行时间和客户端ip,因此我们还需要下面一段代码:
$user_ip = $_SERVER['HTTP_X_FORWARDED_FOR'] ? $_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'];
if (stripos($user_ip, ',') !== false) {
$user_ip = trim(substr($user_ip, strripos($user_ip, ',') + 1));
}
define('user_ip', $user_ip);
define('start_time', microtime(true));
最后再加一个Curl函数用于爬取内容即可:
function curl_get($url, $outime = 10) { $header = [ 'X-FORWARDED-FOR:' . user_ip, 'CLIENT-IP:' . user_ip ]; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT, $outime); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"); // 伪造User-Agent curl_setopt($ch, CURLOPT_HTTPHEADER, $header); curl_setopt($ch, CURLOPT_REFERER, $url); $data = curl_exec($ch); curl_close($ch); return $data; }
以上就是前期的准备工作。
接着向百度提交Request请求:
$get_data = curl_get($url,60);
返回的数据是json格式的,需要解析数据:
$data = json_decode($get_data, true);
通过查看返回数据,可以看到内容是在cards数组下的,因此我们需要先提取该数组的内容:
$cards = $data['data']['cards'];
foreach ($cards as $i => $card) {
$cardContent = $card['content'];
$cardItems = array();
}
因为百度热搜目前有两种,最上面一条是置顶热搜,下面30条是普通热搜,接口返回的数据也是区分开的,因此我们也需要分开处理:
1.首先按照百度热搜请求返回的数据顺序,先取出30条普通热搜并输出:
foreach ($cardContent as $j => $item) { //提取普通热搜的url、desc、hotChange、hotScore、index、hotTag、hotTagImg、img和word的内容 $url = $item['url']; $desc = $item['desc']; $hotChange = $item['hotChange']; $hotScore = $item['hotScore']; $index = $item['index']; $hotTag = $item['hotTag']; $hotTagImg = $item['hotTagImg']; $img = $item['img']; $word = $item['word']; $cardItems[$j]['word'] = $word; $cardItems[$j]['desc'] = $desc; $cardItems[$j]['hotChange'] = $hotChange; $cardItems[$j]['hotScore'] = $hotScore; $cardItems[$j]['index'] = $index; $cardItems[$j]['hotTag'] = $hotTag; $cardItems[$j]['hotTagImg'] = $hotTagImg; $cardItems[$j]['img'] = $img; $cardItems[$j]['url'] = $url; } $output['content'] = $cardItems;
2.提取出置顶热搜并输出:
//提取置顶热搜的url、desc、hotChange、hotScore、index、hotTag、hotTagImg、img和word的内容 $url = $cards[0]['topContent'][0]['url']; $desc = $cards[0]['topContent'][0]['desc']; $hotChange = $cards[0]['topContent'][0]['hotChange']; $hotScore = $cards[0]['topContent'][0]['hotScore']; $index = $cards[0]['topContent'][0]['index']; $hotTag = $cards[0]['topContent'][0]['hotTag']; $hotTagImg = $cards[0]['topContent'][0]['hotTagImg']; $img = $cards[0]['topContent'][0]['img']; $word = $cards[0]['topContent'][0]['word']; $output['topContent']['word'] = $word; $output['topContent']['desc'] = $desc; $output['topContent']['hotChange'] = $hotChange; $output['topContent']['hotScore'] = $hotScore; $output['topContent']['index'] = $index; $output['topContent']['hotTag'] = $hotTag; $output['topContent']['hotTagImg'] = $hotTagImg; $output['topContent']['img'] = $img; $output['topContent']['url'] = $url;
3.提取出更新时间和热搜类型并输出:
$updateTime = date('Y-m-d H:i:s', $cards[0]['updateTime']);
$typeName = $cards[0]['typeName'];
$output['updateTime'] = $updateTime;
$output['typeName'] = $typeName;
4.整合数据,并通过函数使其输出标准化Json数据即可:
msg(200,'请求成功',$baidu_hot_text,);
最后,我们查看一下爬取到并正常输出的数据:
{
"code": 200,
"msg": "请求成功",
"data": {
"content": [...30],
"topContent": {...9},
"updateTime": "2023-09-04 15:39:00",
"typeName": "realtime"
},
"debug": "代码仅供学习使用,请勿非法使用(包括但不限于商业用途等),一切后果由使用者自行承担!",
"exec_time": 0.046286,
"ip": "197.149.235.178"
}
一共31条数据(1条置顶热搜 + 30条普通热搜)。
由于内容太长,没有在文章中显示,其中:
content中的内容为30条普通热搜,topContent中的内容为置顶热搜
通过CSDN资源下载 → https://download.csdn.net/download/xwteam_0662/88297840
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。