赞
踩
- 郑重声明:
- 本项目的所有代码和相关文章,仅用于经验技术交流分享,禁止将相关技术应用到不正当途径,因为滥用技术产生的风险与本人无关。
- 文章仅源自个人兴趣爱好,不涉及他用,侵权联系删
上一次简单罗列了字体反爬的前世今生(https://blog.csdn.net/Owen_goodman/article/details/105490137)
本文就进行实战分析
url:https://maoyan.com/board/1
网页分析:
和上一篇文章分析的是一样的,不信的话我们再看一下详情页:
根据上篇文章分析,我们直接查找文中如何对字体进行处理或者说加密
我们可以看到票房数据被替换了,是被stonefont这个@font-face的名称给替换了,我们去搜索这个stonefont
我们通过这个woff地址去下载woff文件,用FontCreator打开,看到这样
刷新,动态的字体文件结合动态的字体坐标,符合上一篇所说的字体反爬第三阶段,我们这次采用KNN的思想去解决它!
如果不信的话,我们可以多找几个字体文件,用python的第三方库matplotlib,分别对同一个数字根据坐标测试一下:
- import matplotlib.pyplot as plt
- import re
- # 4 uniF728 uniED17
- '''
- <contour>
- <pt x="332" y="-27" on="1"/>
- <pt x="331" y="149" on="1"/>
- <pt x="13" y="149" on="1"/>
- <pt x="13" y="222" on="1"/>
- <pt x="338" y="707" on="1"/>
- <pt x="421" y="697" on="1"/>
- <pt x="421" y="247" on="1"/>
- <pt x="521" y="243" on="1"/>
- <pt x="520" y="143" on="1"/>
- <pt x="425" y="149" on="1"/>
- <pt x="423" y="-26" on="1"/>
- <pt x="331" y="-26" on="1"/>
- </contour>
- '''
- str = """"
- <contour>
- <pt x="319" y="-26" on="1"/>
- <pt x="331" y="140" on="1"/>
- <pt x="13" y="143" on="1"/>
- <pt x="13" y="232" on="1"/>
- <pt x="353" y="709" on="1"/>
- <pt x="421" y="707" on="1"/>
- <pt x="421" y="224" on="1"/>
- <pt x="520" y="236" on="1"/>
- <pt x="526" y="140" on="1"/>
- <pt x="421" y="151" on="1"/>
- <pt x="421" y="-22" on="1"/>
- <pt x="340" y="-16" on="1"/>
- </contour>
- """
- x = [int(i) for i in re.findall(r'<pt x="(.*?)" y=', str)]
- y = [int(i) for i in re.findall(r'y="(.*?)" on=', str)]
- plt.plot(x, y)
- plt.show()
铁证如山!
好吧这里忘记说了,我们上面的数字坐标是怎么得来的呢,其实是将我们的字体文件转换成xml格式,代码如下
- # 用 TTfont把woff文件转换成xml文件 (目的是寻找各个字符的坐标关系)
- from pathlib import Path
- from fontTools.ttLib import TTFont
-
- from pathlib import Path
- from fontTools.ttLib import TTFont
- font1_path = Path(__file__).absolute().parent/"fonts/font_1.xml"
- font2_path = Path(__file__).absolute().parent/"fonts/font_2.xml"
- woff1_path = Path(__file__).absolute().parent/"fonts/c6bf83459074415cf2518fa0597ada382276.woff"
- woff2_path = Path(__file__).absolute().parent/"fonts/c6bf83459074415cf2518fa0597ada382276.woff"
- font_1 = TTFont(woff1_path)
- font_2 = TTFont(woff2_path)
- font_1.saveXML(font1_path)
- font_2.saveXML(font2_path)
转换后的样式如下:
KNN算法比较简单,我们划分 测试集 和 训练集,
先用测试集得到每个数字的距离,
然后在测试的时候我们对不同的输入(也就是数字)就是距离计算,距离最近的即为相同值。
借助Sklearn来做KNN:
将所有 字符对应的字形坐标信息 保存到一个列表当中(注意做好字符与字形坐标的对应关系)
- # -*- encoding: utf-8 -*-
- '''
- @File : font.py
- @Time : 2020/4/13 10:00:00
- @Author : xahoo
- @PythonVersion : 3.6
- @purpose : 获取所有字体获取所有字体所被代表的 数字+坐标(放在列表)
- '''
- import random
- import re
- import time
- from pathlib import Path # 以后要习惯用 pathlib(新模块)
- import requests
- from fontTools.ttLib import TTFont
-
- # 定义默认文件 定义默认url 定义默认headers
- _fonts_path = Path(__file__).absolute().parent/"fonts"
- _brand_url = "https://maoyan.com/board/1"
- _headers = {
- 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
- }
- _proxies = random.choice([
- {"HTTP": "你的代理ip"},
- ])
- '''
- 抓取一次就行,不用每次都抓取
- # 1、首先从首页抓取文字文件信息 .woff文件
- def get_font_contengt():
- response = requests.get(url=_brand_url, headers=_headers, proxies=_proxies)
- # print(response.text)
- woff_url = re.findall(r"url\('(.*?\.woff)'\)", response.text)[0]
- time.sleep(1)
- # font_url = "http:"+str(woff_url)
- font_url = f"http:{woff_url}"
- return requests.get(font_url).content
- # 保存字体文件到指定系统文件夹
- def sava_font():
- for i in range(5):
- font_content = get_font_contengt()
- with open(_fonts_path/f'{i+1}.woff', "wb") as f:
- f.write(font_content)
- get_font_contengt()
- sava_font()
- '''
- # 字体的glyph order就是一组数字和Unicode码的逻辑映射,
- # 而这是不是我们要找的呢?我们检查一下,随便找一组数的Unicode表示 .,
- # 将其对应到上表为6.52,正是网页中看到的数字。看来只要获取这个字体文件的glyph order就可以了,
- # fontTools模块提供了现成的方法:getGlyphOrder()。
- '''
- <GlyphOrder>
- <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
- <GlyphID id="0" name="glyph00000"/>
- <GlyphID id="1" name="x"/>
- <GlyphID id="2" name="uniF7DA"/>
- <GlyphID id="3" name="uniEA22"/>
- <GlyphID id="4" name="uniF10B"/>
- <GlyphID id="5" name="uniE46D"/>
- <GlyphID id="6" name="uniE1FE"/>
- <GlyphID id="7" name="uniF80D"/>
- <GlyphID id="8" name="uniEB29"/>
- <GlyphID id="9" name="uniF878"/>
- <GlyphID id="10" name="uniF783"/>
- <GlyphID id="11" name="uniF27F"/>
- </GlyphOrder>
- '''
- # 获取所有字体所被代表的 数字+坐标
- def get_coor_info(font, cli):# 参数font指的是 字体文件(.woff); 参数cli指的是字体文件里面的数字顺序 例如[6, 7, 4, 9, 1, 2, 5, 0, 3, 8]
- glyf_order = font.getGlyphOrder()[2:]
- info = list()
- for i ,g in enumerate(glyf_order):
- coors = font['glyf'][g].coordinates # 获取字体的所有横纵坐标
- # GlyphCoordinates([(420, 521),(408, 574),(386, 597),(349, 635),(302, 635),(248, 635),(220, 612),(177, 580),(154, 531),(141, 492),(137, 449),(128, 407),(128, 352),(161, 402),(254, 449),(306, 449),(395, 449),(522, 316),(522, 211),(522, 143),(493, 83),(463, 36),(412, -7),(360, -39),(284, -39),(180, -39),(39, 124),(39, 317),(39, 530),(117, 619),(185, 710),(301, 710),(388, 710),(443, 661),(498, 614),(510, 528),(420, 518),(142, 214),(143, 166),(154, 122),(182, 79),(254, 35),(292, 29),(347, 41),(386, 81),(430, 127),(430, 206),(428, 287),(381, 326),(349, 370),(228, 370),(142, 282),(142, 211)])
- '''
- <contour>
- <pt x="420" y="521" on="1"/>
- <pt x="408" y="574" on="0"/>
- <pt x="386" y="597" on="1"/>
- <pt x="349" y="635" on="0"/>
- <pt x="298" y="635" on="1"/>
- <pt x="253" y="635" on="0"/>
- <pt x="220" y="612" on="1"/>
- <pt x="177" y="580" on="0"/>
- <pt x="154" y="522" on="1"/>
- <pt x="141" y="492" on="0"/>
- <pt x="144" y="444" on="1"/>
- <pt x="128" y="407" on="0"/>
- <pt x="128" y="352" on="1"/>
- <pt x="161" y="402" on="0"/>
- <pt x="254" y="449" on="0"/>
- <pt x="306" y="457" on="1"/>
- <pt x="395" y="449" on="0"/>
- <pt x="459" y="382" on="1"/>
- <pt x="522" y="300" on="0"/>
- <pt x="522" y="211" on="1"/>
- <pt x="522" y="147" on="0"/>
- <pt x="493" y="83" on="1"/>
- <pt x="463" y="24" on="0"/>
- <pt x="360" y="-39" on="0"/>
- <pt x="293" y="-39" on="1"/>
- <pt x="193" y="-40" on="0"/>
- <pt x="112" y="43" on="1"/>
- <pt x="39" y="136" on="0"/>
- <pt x="39" y="316" on="1"/>
- <pt x="39" y="530" on="0"/>
- <pt x="117" y="626" on="1"/>
- <pt x="185" y="710" on="0"/>
- <pt x="301" y="710" on="1"/>
- <pt x="399" y="710" on="0"/>
- <pt x="443" y="650" on="1"/>
- <pt x="498" y="611" on="0"/>
- <pt x="510" y="528" on="1"/>
- '''
- coors = [_ for c in coors for _ in c]
- # [420, 521, 408, 574, 386, 597, 349, 635, 302, 635, 248, 635, 220, 612, 177, 580, 154, 531, 141, 492, 137, 449, 128, 407, 128, 352, 161, 402, 254, 449, 306, 449, 395, 449, 522, 316, 522, 211, 522, 143, 493, 83, 463, 36, 412, -7, 360, -39, 284, -39, 180, -39, 39, 124, 39, 317, 39, 530, 117, 619, 185, 710, 301, 710, 388, 710, 443, 661, 498, 614, 510, 528, 420, 518, 142, 214, 143, 166, 154, 122, 182, 79, 254, 35, 292, 29, 347, 41, 386, 81, 430, 127, 430, 206, 428, 287, 381, 326, 349, 370, 228, 370, 142, 282, 142, 211]
- coors.insert(0, cli[i])
- # [6, 420, 521, 408, 574, 386, 597, 349, 635, 302, 635, 248, 635, 220, 612, 177, 580, 154, 531, 141, 492, 137, 449, 128, 407, 128, 352, 161, 402, 254, 449, 306, 449, 395, 449, 522, 316, 522, 211, 522, 143, 493, 83, 463, 36, 412, -7, 360, -39, 284, -39, 180, -39, 39, 124, 39, 317, 39, 530, 117, 619, 185, 710, 301, 710, 388, 710, 443, 661, 498, 614, 510, 528, 420, 518, 142, 214, 143, 166, 154, 122, 182, 79, 254, 35, 292, 29, 347, 41, 386, 81, 430, 127, 430, 206, 428, 287, 381, 326, 349, 370, 228, 370, 142, 282, 142, 211]
- # print(coors)
- info.append(coors)
- return info
- # get_coor_info(TTFont(_fonts_path/"1.woff"),[6,7,4,9,1,2,5,0,2,8])
-
- def get_font_data():
- font_1 = TTFont(_fonts_path/"1.woff")
- cli_1 = [6, 7, 4, 9, 1, 2, 5, 0, 3, 8]
- coor_info_1 = get_coor_info(font_1, cli_1)
-
- font_2 = TTFont(_fonts_path/"2.woff")
- cli_2 = [1, 3, 2, 7, 6, 8, 9, 0, 4, 5]
- coor_info_2 = get_coor_info(font_2, cli_2)
-
- font_3 = TTFont(_fonts_path/"3.woff")
- cli_3 = [5, 8, 3, 0, 6, 7, 9, 1, 2, 4]
- coor_info_3 = get_coor_info(font_3, cli_3)
-
- font_4 = TTFont(_fonts_path/"4.woff")
- cli_4 = [9, 3, 4, 8, 7, 5, 2, 1, 6, 0]
- coor_info_4 = get_coor_info(font_4, cli_4)
-
- font_5 = TTFont(_fonts_path/"5.woff")
- cli_5 = [1, 5, 8, 0, 7, 9, 6, 3, 2, 4]
- coor_info_5 = get_coor_info(font_5, cli_5)
-
- # 列表相加
- infos = coor_info_1 + coor_info_2 + coor_info_3 + coor_info_4 + coor_info_5
- return infos
-
- print(get_font_data())
-
这里我们先测试一下预测准确率
通常情况下,拿到样本数据,先进行缺失值处理,然后取出特征值和目标值,再对样本数据进行分割,分为训练集和测试集,
然后再对样本数据进行标准化处理,最后进行训练预测。
这里测试由于采集的字体数据不多,如果按随机分割的方式,训练集容易缺失某些字符,导致预测测试集的结果误差率较大,所以在此固定前40个样本为训练集,最后10个样本为测试集合。
另外,多次测试发现,此处进行标准化,会影响成功率,所以不采用,另外k值取1,
也就是说,我判定当前样本跟离它最近的那个样本属于同一类型,即同一个字符,这个值取多少合适经过调试才知道,最后预测10个样本,包含了0-9 10个字符,成功率为100%。
- # -*- encoding: utf-8 -*-
- '''
- @File : knn_test.py
- @Time : 2020/4/13 11:00:00
- @Author : xahoo
- @PythonVersion : 3.6
- @purpose : 所有字体获取所有字体所被代表的 数字+坐标(放在列表)已经全部获取到,现在我们来测试一下准确率(使用knn算法训练数据)
- '''
- import numpy as np
- import pandas as pd
- from font import get_font_data
- from sklearn.impute import SimpleImputer
- from sklearn.model_selection import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.preprocessing import StandardScaler
-
- def main():
- # 处理缺失值,这里涉及到机器学习基本知识(新版本用 SimpleImputer)
- # 新旧版本区别:http://www.bubuko.com/infodetail-2926071.html
- # missing_values(缺失值),strategy(策略,默认平均值),axis(选择行列,0为列,1为行)
- imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # # 缺省值选择空,策略为平均数,
-
- # 取出特征值,目标值
- data = pd.DataFrame(imputer.fit_transform(pd.DataFrame(get_font_data())))
- # print(data)
- x = data.drop([0], axis=1)
- y = data[0]
- # print(x,y)
-
- # 分割数据集
- # x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0) (随机分割方式)
- # 由于采集的字体数据不多,如果按随机分割的方式,训练集容易缺失某些字符,导致预测测试集的结果误差率较大,
- # 所以在此固定前40个样本为训练集,最后10个样本为测试集合。
- x_train = x.head(30)
- y_train = y.head(30)
- x_test = x.tail(10)
- y_test = y.tail(10)
-
- # 标准化
- # 多次测试发现,此处进行标准化,会影响成功率,所以不采用,另外k值取1, 也就是说,我判定当前样本跟离它最近的那个样本属于同一类型,
- # 即同一个字符,这个值取多少合适经过调试才知道,最后预测10个样本,包含了0-9 10个字符,成功率为100%。
- # std = StandardScaler()
- # x_train = std.fit_transform(x_train)
- # x_test = std.transform(x_test)
-
- # 进行算法流程
- knn = KNeighborsClassifier(n_neighbors=1)
-
- # 开始训练
- knn.fit(x_train, y_train)
-
- # 预测结果
- y_predict = knn.predict(x_test)
- print(y)
- print(y_predict)
-
- # 得出准确率
- print(knn.score(x_test, y_test))
-
- main()
这里测试我们的准确率是100%。下面整理一下代码
- # -*- encoding: utf-8 -*-
- '''
- @File : knn_font.py
- @Time : 2020/4/13 13:30:00
- @Author : xahoo
- @PythonVersion : 3.6
- @purpose : (本文件是对测试文件knn_test.py的总结)
- '''
-
- import numpy as np
- import pandas as pd
- from font import get_font_data
- from sklearn.impute import SimpleImputer
- from sklearn.model_selection import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.preprocessing import StandardScaler
-
- class Classify(object):
- def __init__(self):
- self.len = None
- self.knn = self.get_knn()
-
- def process_data(self, data):
- # 处理缺失值,这里涉及到机器学习基本知识(新版本用 SimpleImputer)
- # 新旧版本区别:http://www.bubuko.com/infodetail-2926071.html
- # missing_values(缺失值),strategy(策略,默认平均值),axis(选择行列,0为列,1为行)
- imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # # 缺省值选择空,策略为平均数,
- # 取出特征值,目标值
- return pd.DataFrame(imputer.fit_transform(pd.DataFrame(data)))
-
- def get_knn(self):
- # 取出特征值,目标值,传入data get_font_data()
- data = self.process_data(get_font_data())
- x_train = data.drop([0], axis=1)
- y_train = data[0]
-
- # 进行算法流程
- knn = KNeighborsClassifier(n_neighbors=1)
- # 开始训练
- knn.fit(x_train, y_train)
-
- self.len = x_train.shape[1]
- return knn
-
- # knn预测
- def knn_predict(self, data):
- df = pd.DataFrame(data)
- data = pd.concat([df, pd.DataFrame(np.zeros((df.shape[0], self.len - df.shape[1])), columns=range(df.shape[1], self.len))])
- data = self.process_data(data)
- y_predict = self.knn.predict(data)
- return y_predict
-
-
- # -*- encoding: utf-8 -*-
- '''
- @File : knn_test.py
- @Time : 2020/4/13 11:00:00
- @Author : xahoo
- @PythonVersion : 3.6
- @purpose : 得到训练好的流程之后我们进行测试, 通过抓取网站信息,验证数据集训练是否正确。最后代码
- '''
- import random
- import time
- import re
- from io import BytesIO
- from pathlib import Path
- import requests
- from lxml import etree
- from fontTools.ttLib import TTFont
- from knn_font import Classify
-
- _woff_path = Path(__file__).absolute().parent/"fonts"/"test.woff"
- _board_url = "https://maoyan.com/board/1"
- _headers = {
- 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
- }
- _proxies = random.choice([
- {"HTTP": "ip"},
- ])
- _classify = Classify()
-
- # 建立映射,将数字和对应的编码建成字典形式:
- def get_map(text: str):
- woff_url = re.findall(r"url\('(.*?\.woff)'\)", text)[0]
- time.sleep(1)
- font_url = f"http:{woff_url}"
- content = requests.get(font_url).content
- # 新抓取文字文件.woff,用来做映射测试
- with open(_woff_path, 'wb') as f:
- f.write(content)
- # BytesIO 是指在内存中操作二进制
- font = TTFont(BytesIO(content))
- glyf_order = font.getGlyphOrder()[2:]
- info = []
- for g in glyf_order:
- coors = font['glyf'][g].coordinates
- coors = [_ for c in coors for _ in c]
- info.append(coors)
- map_li = map(lambda x: str(int(x)), _classify.knn_predict(info))# map映射函数
- uni_li = map(lambda x: x.lower().replace('uni', '&#x') + ';', glyf_order)
- return dict(zip(uni_li, map_li))
-
-
- # 采集数据并替换,获取正确数据
- def get_info():
- text = requests.get(url=_board_url,headers=_headers, proxies=_proxies).text
- map_dict = get_map(text=text)
- for uni in map_dict.keys():
- text = text.replace(uni, map_dict[uni])
- html = etree.HTML(text)
- post_li = html.xpath('//*[@id="app"]/div/div/div/dl/dd/a/@href')
- # print(dd_li)
- for i, post in enumerate(post_li):
- name = html.xpath('//*[@id="app"]/div/div/div/dl/dd/a/@title')[i]
- starring = html.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div[1]/p[2]//text()')[i]
- releasetime = html.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div[1]/p[3]//text()')[i]
- total_box_office = html.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div[2]/p/span/span//text()')[i]
- print("".join(name), " ", "".join(starring), " ", "".join(releasetime),"".join(total_box_office).replace(" ", "").replace("\n", ""))
-
- get_info()
完美抓到票房金额。光整理都整理了两天,小小爬虫需要学的东西太多了。。
参考各位大佬:
https://cloud.tencent.com/developer/article/1525768
https://cloud.tencent.com/developer/article/1553787
https://blog.csdn.net/weixin_43116910/article/details/103439930
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。