赞
踩
nltk数据集比较多,如果一个一个的点难免崩溃,下面是我做的一个小脚本供大家使用:
import requests import bs4 from urllib.parse import urlparse from bs4 import BeautifulSoup import os import time # nltk 数据网页为:http://www.nltk.org/nltk_data/ # 直接另存为搞定, 这里指定网页位置 nltk_page = 'nltk.html' # 国内下载有点,问题,我这里用了一个代理,大家如果没有代理,可以留言,我及时更新数据. proxies = {'http': 'socks5://localhost:1080', 'https': 'socks5://localhost:1080'} def save_data(data, path): if not os.path.exists(os.path.dirname(path)): os.makedirs(os.path.dirname(path)) with open(path,'wb') as fw: fw.write(data) req_data = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0",} with open(nltk_page,'r') as fr: data = fr.read() while True: total_num,suc_num,err_num = 0,0,0 soup = BeautifulSoup(data, 'html.parser') for link in soup.find_all('a'): if link.get_text().strip().lower() == 'download': link_url = link.get('href') save_path = urlparse(link_url).path.strip('/') total_num += 1 if os.path.exists(save_path): continue time.sleep(3) try: r = requests.get(link_url, headers = req_data, proxies=proxies) suc_num += 1 print(total_num,'[%d]down:'%(suc_num),save_path) except: err_num += 1 print(total_num, '[%d]error:'%(err_num),link_url) continue save_data(r.content,save_path) if err_num == 0: print('down finish!!') break else: print('try again') print('-----------------------------------------------------') time.sleep(10) print('total_num:',total_num,'suc_num:',suc_num,'err_num:',err_num)
nltk 下载数据,百度云盘地址:
链接: https://pan.baidu.com/s/1r0qPlhXF3ScAG1I3U1_Qdg 提取码: v166
ps: 这个只供学习使用,如果商用请联系数据提供方,给他们更多帮助,如果有其他问题,请留言交流!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。