赞
踩
简单的获取数据方法就是使用requests和beatifulsoup,对于初学者和小型项目,一般的操作都是可以完成的。下面介绍这两种库的使用方法。
基本用法
import requests
response = requests.get('http://www.baidu.com')
print(response.status_code)
print(response.text)
print(response.cookies)
各种请求方式
import requests
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/post')
requests.delete('http://httpbin.org/post')
requests.head('http://httpbin.org/post')
requests.options('http://httpbin.org/post')
传入参数的get请求
import requests
data = {
'name': 'germey',
'age': 22
}
response = requests.get('http://httpbin.org/get', params=data)
print(response.text)
解析json
print(response.json())
print(json.loads(response.text))
获取二进制数据并存储
response = requests.get('http://github.com/favicon.ico')
print(response.content) # btes类型 一个图片
with open('favicon.ico', 'wb') as f:
f.write(response.content)
添加header
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
response = requests.get('https://www.zhihu.com/explore', headers=headers)
print(response.text)
post请求
import requests
data = {'name': 'germey', 'age': '22'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
response = requests.post('http://httpbin.org/post', data=data, headers=headers)
print(response.json())
响应的属性
import requests
response = requests.get('http://www.jianshu.com')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)
状态码判断
状态码:响应浏览器的请求
200:请求成功
301:资源被永久转移到其他URL
404:请求资源不可用
500:内部服务器错误
import requests
response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.ok else print('Request Successful')
文件上传
files = {'file': open('favicon.ico', 'rb')}
response = requests.post('http://httpbin.org/post', file=files)
获取cookie
response = requests.get('http://www.baidu.com')
print(response.cookies)
for key, value in response.cookies.items():
print(key, '=', value)
会话位置,模拟登录
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')
print(response.text)
关闭警告信息
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)
证书验证
response = requests.get('http://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)
设置代理
proxies = {
'http': 'http://127.0.0.1:9743',
'https': 'https://127.0.0.1:9743'
}
response = requests.get('http://www.taobao.com', proxies=proxies)
print(response.status_code)
超时设置
response = requests.get('https://www.taobao.com', timeout=1)
print(response.status_code)
认证设置
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)
异常处理
import requests
from requests.exceptions import ReadTimeout, HTTPError, RequestException
try:
response = requests.get('http://httpbin.org/get', timeout=0.5)
print(response.status_code)
except ReadTimeout:
print('Timeout')
except HTTPError:
print('HTTPError')
except RequestException:
print('RequestException')
基本用法
import requests
response = requests.get('https://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book')
html = response.text # 整个页面的html代码
html
基本操作
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify()) # 格式化代码
print(soup.title.string) # 选择里面的title
标签选择器
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
# print(soup.head)
print(soup.p) # 只选择第一个
获取名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['class'])
print(soup.p['name'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
嵌套选择
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
子节点与子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents) #标签的子节点 为list类型
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children) # 与content不同:children是一个迭代器
for i, child in enumerate(soup.p.children):
print(i, child)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants) # 获取所有子孙节点的迭代器
for i, child in enumerate(soup.p.descendants):
print(i, child)
父节点和子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent) # 获取父节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents))) # 获取祖先节点
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings))) # 后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings))) # 前面的兄弟节点
标准选择器
find_all(name, attrs, recursive, text, **kwargs)
find(name, attrs, recursive, text, **kwargs)返回单个元素 find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul')) # 返回一个列表
print(type(soup.find_all('ul')[0]))
# 嵌套
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
attrs
# 根据属性进行查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'class':'more-items'}))
print(soup.find_all(class_='more-items'))
print(soup.find_all(id='id_name'))
根据文本内容选择text
# 根据文本内容选择
print(soup.find_all(text='Foo'))
CSS选择器
通过select()直接传入css选择器即可以完成选择
print(soup.select('.panel .pane-heading')) # 选择class的时候 加一个点
print(soup.select('ul li')) # 选择ul中的li标签
print(soup.select('#list2')) # 选择id 用#
print(soup.select('#list2 .element')) # 选择id为list2 的下面的class='element'
for ul in soup.select('ul'):
print(ul.select('li'))
获取属性
for ul in soup.select('a'):
print(ul['href'])
print(ul.attrs['href'])
print(ul['id'])
获取内容
for a in soup.select('a'):
print(a.get_text())
https://www.cnblogs.com/xiao-apple36/p/8830547.html#_label2
如果我们获取了一些数据,需要对他们进行清洗,这里以数据格式为(数据 唯一标识)为例(这里只是简单的举个例子,可以将方法用在其他需要处理数据的位置)
他所到之处都是人山人海堪比追星现场。大家纷纷拿起手机拍照留念,争取留下这中国互联网史上历史性的一刻。\t flag
他所到之处都是人山人海堪比追星现场。\t flag
首先我们观察这些数据,每一行中,都有一段话,最后一个结尾存在一个唯一标识,那么我们要做的第一步就是将整段话分割成整段话。我们叫【分句】即分割成下面的格式
他所到之处都是人山人海堪比追星现场。\t flag
大家纷纷拿起手机拍照留念,争取留下这中国互联网史上历史性的一刻。\t flag
他所到之处都是人山人海堪比追星现场。\t flag
分句后的句子及其混乱,我们无法看到句子的规律,所以,我们要对句子进行排序,这里是要使用一个linux的命令对句子进行排序
LC_ALL=C sort -u file > file_new
句子排序后,我们可以很方便的观察它们的规律啦。
然后就是使用正则表达式对他们进行清洗。
清洗结束后,我们得到的数据格式为(一个句子 \t flag),这里一共有两列,我们需要对句子按照第一列去重,这里使用linux中的awk命令:
awk -F '\t' '{array_tmp[$1]=$0}END{for(i in array_tmp){print array_tmp[i]}}' fromfile > tofile
最后数据清洗的主要流程如下:
在工程中,我们创建的目录如图:
其中文件夹
def sentence_split(str_centence): pattern = [";", "?", "!", "。"] seg_line = [] index_d = [] flag = False line = str_centence.strip() for i in range(len(line)): if line[i] == "“" or line[i] == "(": flag = True elif line[i] == "”" or line[i] == ")": flag = False if line[i - 1] in pattern: index_d.append(i) if not flag: if line[i] in pattern and i + 1 < len(line): index_d.append(i) index_d.insert(0, 0) index_d.append(len(line)) for num in range(len(index_d) - 1): if num == 0: seg_line.append(line[index_d[num]:index_d[num + 1] + 1]) else: seg_line.append(line[index_d[num] + 1:index_d[num + 1] + 1]) return seg_line
# 中英判断 全是英文(没有一个中文),则返回False
def exist_ch(input_for_exist):
for uchar in input_for_exist:
if (uchar >= u'\u4e00') and (uchar <= u'\u9fa5'):
return True
return False
零宽字符在程序的输出和编辑器中不可见,但是使用linux中的vim是可以看到的。可以自己查看字符编码,然后根据字符编码进行删除。
for line in open("test"):
for i in line:
print(i)
print(ord(i))
上面的程序发现了两个零宽字符,他们的编码分别是8205, 8203。然后,根据编码去除
with open('./05_last/test.res', 'a', encoding='UTF-8') as f:
for line in open('./05_last/test', encoding='UTF-8'):
for w in line:
if ord(w) in [8205, 8203]:
line = line.replace(w, '')
f.write(line)
01_split.py
filename_raw = './01_raw/00_filename' filename_split = './02_split/00_split' def sentence_split(str_centence): pattern = [";", "?", "!", "。"] seg_line = [] index_d = [] flag = False line = str_centence.strip() for i in range(len(line)): if line[i] == "“" or line[i] == "(": flag = True elif line[i] == "”" or line[i] == ")": flag = False if line[i - 1] in pattern: index_d.append(i) if not flag: if line[i] in pattern and i + 1 < len(line): index_d.append(i) index_d.insert(0, 0) index_d.append(len(line)) for num in range(len(index_d) - 1): if num == 0: seg_line.append(line[index_d[num]:index_d[num + 1] + 1]) else: seg_line.append(line[index_d[num] + 1:index_d[num + 1] + 1]) return seg_line with open(filename_split, 'a', encoding='UTF-8') as f: for line in open(filename_raw, encoding='UTF-8'): sentence = sentence_split(line) if len(sentence) == 1: f.write(sentence[0] + '\n') else: for s in sentence[:-1]: f.write(s + sentence[-1] + '\n')
01_split.py(修改)
filename_raw = './01_raw/08_data6' filename_split = './02_split/08_split6' def sentence_split(line): end_w = [';', '。', '!', '?', ';', '?', '!'] l = len(line) flag = '' for i in range(l): if line[l - 1 - i] == '\n': line = line[:-1] continue if line[l - 1 - i] != '\t': flag += line[l - 1 - i] line = line[:-1] else: line = line[:-1] break flag = flag[::-1] sentences = [] sentence = '' for w in line: if w in end_w: sentence += w sentences += [sentence] sentence = '' else: sentence += w sentences += [sentence] for i in range(len(sentences)): sentences[i] = sentences[i] + '\t' + flag return sentences with open(filename_split, 'a', encoding='UTF-8') as f: for line in open(filename_raw, encoding='UTF-8'): sentences = sentence_split(line) for sentence in sentences: f.write(sentence + '\n')
02_clean.py
import re filename_sort = './03_sort/00_filename' filename_clean = './04_clean/00_clean' # 找到有用的字符 pattern = r'[\u4e00-\u9fa5a-zA-Z0-9,。《》()\(\)\.@\t\n\"\??~!!\'“”【】%]+' # 删除邮箱 pattern_email = r'[0-9a-zA-Z_]{0,19}@[0-9a-zA-Z]{1,13}\.[com,cn,net]{1,3}' # 删除每句话前面的垃圾 pattern_start = r'^(([\((][0-9]+[\))])|(,)|(())|([0-9]+[)\).。]\s*)|(\t+?)|([\((][一二三四五六七八九十]{1}[\))]))' # 删除数字括号(1)(2) pattern_braces_digit = r'[\((][0-9]+[\))]' # 去掉前面是垃圾的话 pattern_del_start = r'^(([a-zA-Z])|(()|(参阅)|([0-9]{2}年)|(原文链接)|(()|(!)|(?))' # 中英判断 全是英文(没有一个中文),则返回False def exist_ch(input_for_exist): for uchar in input_for_exist: if (uchar >= u'\u4e00') and (uchar <= u'\u9fa5'): return True return False with open(filename_clean, 'a', encoding='UTF-8') as f: for line in open(filename_sort, encoding='UTF-8'): line = ''.join(re.findall(pattern, line)) # 找到满足正则的数据 line = re.sub(pattern_email, '', line) # 删除匹配正则的字符串 line = re.sub(pattern_start, '', line) line = re.sub(pattern_braces_digit, '', line) if len(line) < 25 or re.match(pattern_del_start, line) or not exist_ch(line): continue f.write(line)
常见符号含义
正则 | 匹配内容 |
---|---|
[\u4e00-\u9fa5] | 汉字 |
[0-9a-zA-Z_]{0,19}@[0-9a-zA-Z]{1,13}.[com,cn,net]{1,3} | 邮箱 |
[\u4e00-\u9fa5a-zA-Z0-9,。《》()().@\t\n"??~!!’“”【】]+ | 过滤有用汉字 |
常见正则
pattern = r'[\u4e00-\u9fa5a-zA-Z0-9,。《》()\(\)\.@\t\n\"\??~!!\'“”【】]+'
line = ''.join(re.findall(pattern, line))
pattern_email = r'[0-9a-zA-Z_]{0,19}@[0-9a-zA-Z]{1,13}\.[com,cn,net]{1,3}'
line = re.sub(pattern_email, '', line)
pattern_start = r'^(([\((][0-9]+[\))])|(,)|(())|([0-9]+[)\).。]\s*))'
line = re.sub(pattern_start, '', line)
pattern_del_start = r'^(([a-zA-Z])|(()|(参阅)|([0-9]{2}年)|(原文链接)|(()|(!)|(?)|(编译)|(编辑)|(本文))'
if len(line) < 22 or re.match(pattern_del_start, line) :
continue
pattern1 = r'【\w+】'
line = re.sub(pattern1, '', line)
pattern_braces_digit = r'[\((][0-9]+[\))]'
line = re.sub(pattern_braces_digit, '', line)
partten_braces = r'[\((](本文|文章|来源|摘自|作者|原稿|[a-zA-Z\s.]*).*[\))]'
pattern_digit = r'[0123456789]+[、\.\))]\s*'
pattern_start = r'\*+\s*'
pattern_url = r'[a-zA-z]+://[^\s]*'
ll 或 ls
scp 文件名 路径
scp -r 文件名字 账号@ip地址:/home/zhandong
scp -r * username@00.00.00.000:/home/dirname
rm filename
参考资源:
https://blog.csdn.net/sinat_24648637/article/details/84191373
from selenium import webdriver # 先导入selenium模块,没安装的自行百度安装就好
chrome = webdriver.Chrome(executable_path='../chromedriver.exe')
chrome.get('https://www.toutiao.com/c/user/6466299604/#mid=6466299604') # 头条链接
ascp = chrome.execute_script('return ascp.getHoney()') # 获取连接中的as与cp的值
sinature = chrome.execute_script('return TAC.sign(' + str(user_id) + str(max_behot_time) + ')') # 获取链接中的sinature的值
print(ascp)
# print(sinature)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。