赞
踩
上篇爬虫博客,我们组合了scrapy和selenium对生活区的视频信息进行爬取。新手来说还是比较难理解的,有没有更简单的爬取方式呢?今天,就来继续分享一下某视频热门视频爬取另一种方式:API接口爬取。
以某视频网站的动画区的AMD为例,打开热门视频所在页面。使用谷歌浏览器的检查功能,对页面刷新,找到视频信息所在的接口,如下图所示:
result中共20条视频相关信息,需要爬取的字段
字段名 | 含义 |
---|---|
author | 作者名 |
id | 视频aid编号(后续爬取需要) |
duration | 视频时长 |
favorite | 收藏数 |
play | 播放量 |
pubdate | 发布时间 |
review | 评论数 |
title | 视频名称 |
主要的爬取思路:找到该接口的URL,利用requests获取html文档信息。利用json模块将其转化为字典,接着利用字典各个键获取信息即可。
需要注意一点的是,该接口的URL很长,从search?后面其实都是传入下面这个字典的参数,其中page是代表当前页,time_from和time_to是限制所视频的投稿日期范围,cate_id是各个区编码代号(动画区其他部分代码号为25,47,210)
(注意:构造的时候可以不用将callback传入!)
全部代码
import requests import json import time import pandas as pd import os import random os.chdir('C:/users/dell/Desktop/') url = '为通过审核,这部分请读者自行获取URL' head=[ "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14", "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02", "Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00", "Opera/9.80 (Windows NT 5.1; U; zh-sg) Presto/2.9.181 Version/12.00", ] headers={ 'user-agent':random.choice(head), 'refer':'为通过审核,请读者自行获取' } def get_inf(url,headers,num): df = [] for page in range(num): params = { 'main_ver': 'v3', 'search_type': 'video', 'view_type': 'hot_rank', 'order': 'click', 'copy_right': -1, 'cate_id': 24, 'page': page, 'pagesize': 20, 'jsonp': 'jsonp', 'time_from': 20201201, 'time_to': 20201219 } try: r = requests.get(url,headers=headers,params=params) data = json.loads(r.text) inf_list = data['result'] for i in range(len(inf_list)): author = data['result'][i]['author'] duration = data['result'][i]['duration'] collection = data['result'][i]['favorites'] aid = data['result'][i]['id'] title = data['result'][i]['title'] pubdate = data['result'][i]['pubdate'] review = data['result'][i]['review'] play = data['result'][i]['play'] df.append([author,title,pubdate,aid,duration,play,review,collection]) print('第{}页爬取完毕'.format(page+1)) time.sleep(random.randint(1,2)) except: print('未爬到数据') df = pd.DataFrame(df,columns=['UP主','视频标题','发布时间','aid','视频时长','播放量','评论数','收藏数']) df.to_csv('bilibili.csv',encoding='gb18030',index=False) print('共{}条热门视频信息'.format(len(df))) if __name__=='__main__': start = time.time() num = 100 get_inf(url,headers,num) end = time.time() print('共花费{}分钟'.format(round((end-start)/60,2)))
由于是接口爬取,速度还是非常快的,2.76分钟爬完2000条数据
当然,从上面爬取的结果可知,原视频的点赞量、投币数以及弹幕数是没有的。因此,我们可以打开任意一个热门视频,运用同样的方法进行抓包,为加快爬取速度,在此使用协程异步爬取。
import numpy as np import aiohttp import asyncio import pandas as pd import os import random import time import json os.chdir('C:/users/dell/Desktop/') with open('aid.txt','r') as f: #将爬取得到aid保存为aid.txt aid = f.readlines() df1 = [] aid_list = np.char.strip(aid,'\n') semaphore = asyncio.Semaphore(10) #限制并发量 head=[ "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14", "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02", "Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00", "Opera/9.80 (Windows NT 5.1; U; zh-sg) Presto/2.9.181 Version/12.00", ] header={ 'user-agent':random.choice(head), 'refer':'读者自行获取' } async def get_page(url): await asyncio.sleep(2) async with aiohttp.ClientSession() as session: async with await session.get(url, headers=header) as r: r_text = await r.text() return r_text async def parse_page(text): try: data = json.loads(text) name = data['data']['owner']['name'] video_name = data['data']['title'] release_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['data']['pubdate'])) #时间戳转化 like = data['data']['stat']['like'] coin = data['data']['stat']['coin'] collection = data['data']['stat']['favorite'] share = data['data']['stat']['share'] view = data['data']['stat']['view'] dm = data['data']['stat']['danmaku'] df1.append([name,video_name,release_time,like,coin,collection,share,view,dm]) except: print('未爬到数据') async def get_data(url): async with semaphore: text = await get_page(url) await parse_page(text) print('下载完成:', url) def main(): url_list = ['为通过审核,读者自行获取该部分'+aid for aid in aid_list] tasks = [asyncio.ensure_future(get_data(url)) for url in url_list] loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) loop.close() if __name__=='__main__': start = time.time() main() df1 = pd.DataFrame(df1,columns=['UP主','视频名称','发布时间','点赞数','硬币数','收藏数','分享数','播放量','弹幕数']) df1.to_csv('b站.csv',encoding='gb18030',index=False) end = time.time() print('共花费{}分钟'.format(round((end-start)/60,2)))
同样爬取速度也是非常快的(早知道有接口就不用scrapy爬了,哈哈哈~),如下图所示
上篇博客,scrapy+selenium大概2个半小时爬完13000条数据。而这次,粗略计算一下,在网速正常的情况下,若要爬相同条数的视频信息,大约需要65分钟,又一次加快了速度!(所以,这里还是推荐大家在爬取网站前先去找找看有没有这种接口。不过像这种没有加密参数,也没有限制爬取条数限制的接口也是可遇不可求的。)
后续有时间将补上该部分的数据分析,以上就是本次分享的全部内容~
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。