赞
踩
尽管爬虫再强大,他也不具备将人家论文数据库全弄下来的能力,还是要在官网的筛选引擎中进行初步筛选,点开Advanced search。 将关键词用引号括起来,以 OR 分割(记得加空格)。在出版标题中输入 IEEE Transactions on Knowledge and Data Engineering。
点击搜索。会进入一个结果界面,我们就能得到一个URL:
https://ieeexplore.ieee.org/search/searchresult.jsp?action=search&newsearch=true&matchBoolean=true&queryText=(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)&ranges=2020_2024_Year
但是三年的代码经验告诉我,这个URL不全,因为搜索结果是分页的,这个URL没有页码,所以我直接点下一页,获得另一个URL:
https://ieeexplore.ieee.org/search/searchresult.jsp?action=search&newsearch=true&matchBoolean=true&queryText=(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)&ranges=2020_2024_Year&highlight=true&returnFacets=ALL&returnType=SEARCH&matchPubs=true&pageNumber=2
此时我们能看到在Year之后出现了很多参数,这些应该都是默认参数不用改,但是最后一个是pageNumber=2
这时候页码才是这个URL的关键。
我们需要创建一个虚拟环境专门用于爬虫编写,因为在写爬虫的过程中经常需要调试,VScode支持的调试版本是py3.7以上,所以选择了比较熟悉的3.8。
conda create -n beautifulsoup_py38 python=3.8
conda activate beautifulsoup_py38
conda install requests
conda install pandas
conda install bs4
conda install lxml
conda install selenium
不同于静态网页,现在很多网页都用AJAX或者别的方法动态加载,如果用bs4最常用的方式,只会得到一串JS代码,这就是一个重大调整,后来我找到了selenium这个库,Selenium是一个用电脑模拟人操作浏览器网页,可以实现自动化,测试等。说白了这个方法很蠢,就是打开网页然后等着,等他加载完再爬…但是好在能全自动。
web = requests.get(url, headers = myHttpheader)
web.encoding = 'utf-8' # important
soup = BeautifulSoup(web.text,'lxml')
使用selenium前除了下载库外,还需要下载浏览器驱动,我用的是Chrome浏览器,在设置里可以看到最新版本。我的是118版本。
下载好后,解压找到exe文件,
配置环境变量:此电脑→右击属性→高级系统设置→环境变量→用户变量→Path→编辑→新建 C:\Program Files\Google\Chrome\Application\ 最后记得确认
申请百度密匙:
百度翻译开放平台:http://api.fanyi.baidu.com/api/trans/product/index
既然百度翻译需要验证密匙,我们就有必要先申请账号,获取密匙。申请完后点开发者信息就可以获取密匙。
在筛选界面随便点一篇文章,能看到URL是
https://ieeexplore.ieee.org/document/9942340
然后进入浏览器开发视图,看看链接上的href,能看到是/document/9942340/
,那么我们就只需要收集所有href,然后拼接一下就能获取文章了。
因为文章标题的a标签的class是fw-bold
,所有设置了一个等待,等待出现fw-bold
这个class的时候才开始解析,最后将所有href放进一个列表,关闭浏览器。
import pandas as pd import os import time import numpy as np from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait #等待页面加载 from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC # 设置浏览器选项 options = webdriver.ChromeOptions() # options.add_argument('--headless') # 无头模式,不打开浏览器窗口 def get_urls(url): # 因为文章是js动态加载的,所以先要加载js加载的内容 try: # 创建浏览器对象 driver = webdriver.Chrome(options=options) driver.implicitly_wait(20) # 访问网页 driver.get(url) driver.find_element_by_class_name('fw-bold') # 获取动态加载的网页内容 dynamic_content = driver.page_source # 使用BeautifulSoup解析动态内容 soup = BeautifulSoup(dynamic_content, 'html.parser') # web = requests.get(url, headers = myHttpheader) # web.encoding = 'utf-8' # important # soup = BeautifulSoup(web.text,'lxml') # print(web.text) soup = soup.findAll('a',attrs={'class':'fw-bold'}) # pattern = r'/document/\d+/' urls=[] for x in soup: urls.append(x['href'][:-1]) urls = list(set(urls)) print(len(urls)) except: pass # 关闭浏览器 driver.quit() return urls # 阶段调试 if __name__ == '__main__': get_urls('https://ieeexplore.ieee.org/search/searchresult.jsp?action=search&newsearch=true&matchBoolean=true&queryText=(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)&ranges=2020_2024_Year&highlight=true&returnFacets=ALL&returnType=SEARCH&matchPubs=true&pageNumber=1')
看到如果输出25,那就是没问题了,因为默认一页有25篇文章
因为需要把英文的标题和摘要翻译成中文,所以先写一个翻译函数
import http.client import hashlib import urllib import random import json appid = 'xxx' # 填写你的appid secretKey = 'xxx' # 填写你的密钥 httpClient = None url_baidu = '/api/trans/vip/translate' # 通用翻译API HTTP地址 def translateBaidu(text, f='en', t='zh'): salt = random.randint(32768, 65536) sign = appid + text + str(salt) + secretKey sign = hashlib.md5(sign.encode()).hexdigest() url = url_baidu + '?appid=' + appid + '&q=' + urllib.parse.quote(text) + '&from=' + f + '&to=' + t + \ '&salt=' + str(salt) + '&sign=' + sign try: httpClient = http.client.HTTPConnection('api.fanyi.baidu.com') httpClient.request('GET', url) # response是HTTPResponse对象 response = httpClient.getresponse() result_all = response.read().decode("utf-8") data = json.loads(result_all) result = str(data['trans_result'][0]['dst']) return result except Exception as e: print (e) finally: if httpClient: httpClient.close() # 阶段调试 if __name__ == '__main__': translateBaidu('i am happy')
看到如果输我很高兴
,那么就没有问题
根据前面找到的URL规律,对文章标题与摘要的标签进行解析得到文本内容,其中有一些文章没有doi代码,主要是书籍相关的部分,需要做一些异常处理。并进行翻译。
def get_info(urls): url1='https://ieeexplore.ieee.org'+urls # 创建浏览器对象 driver = webdriver.Chrome(options=options) # 访问网页 driver.get(url1) content = driver.page_source try: # 使用BeautifulSoup解析内容 soup = BeautifulSoup(content, 'html.parser') # 显式等待:指定等待某个标签加载完毕 wait=WebDriverWait(driver,5) wait.until(EC.presence_of_element_located((By.CLASS_NAME,'document-main'))) tle = soup.find('h1',attrs={'class':"document-title text-2xl-md-lh"}).find('span').text title = tle.strip() abstract = soup.findAll('div',attrs={'class': 'u-mb-1'})[1].text zh_title = translateBaidu(title) zh_abstract = translateBaidu(abstract) try: doi = soup.find('div',attrs={'class': 'u-pb-1 stats-document-abstract-doi'}).find('a')['href'] info = {'title':title,'标题':zh_title,'abstract':abstract,'摘要':zh_abstract,'doi':doi} driver.quit() return info except: info = {'title':title,'标题':zh_title,'abstract':abstract,'摘要':zh_abstract} driver.quit() return info except: driver.quit() # 阶段调试 if __name__ == '__main__': url = '/document/10154753' get_info(urls=url)
最后将所有结合起来,写成一段脚本,为了减少难以避免的问题对文章收集的影响,我一页保存一次,但是这个官网实在比较慢,平均一页都需要10分钟。
if __name__ == '__main__': # set the work file directory path = r'C:/Users/pdnbplus/Documents/python全系列/网络爬虫/爬取IEEE文章/result' if not os.path.exists(path): print(path) os.mkdir(path) os.chdir(path) # Get the start urls. start_urls = 'https://ieeexplore.ieee.org/search/searchresult.jsp?action=search&newsearch=true&matchBoolean=true&queryText=(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)&ranges=2020_2024_Year&highlight=true&returnFacets=ALL&returnType=SEARCH&matchPubs=true&pageNumber=' for i in range(10,21): info_all = [] url_i = start_urls + str(i) top_urls = [] try: top_urls = top_urls + get_urls(url_i) except Exception as e: print(i,e) for url in top_urls: time.sleep(random.randint(1,5)/10) info = get_info(url) if info and len(info)!=0: info_all = info_all + [info] info_all = pd.DataFrame(info_all) info_all.to_csv(path+f'/article{i}.csv')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。