赞
踩
射门数据的可视化,本质上就是散点图,只是点的大小按期望进球值(预测进球概率)变化,提高了直观性和可视性。
一、https://understat.com联赛数据网
足球运动员的射门数据来自https://understat.com,进入主页,搜索姆巴佩“Mbappe”(见图1)。
图1 https://understat.com联赛数据网主页搜索
进入基利安·姆巴佩(Kylian Mbappé)页面,姆巴佩的player_id=3423,所以他的页面网址是https://understat.com/player/3423。https://understat.com/网站提供自2014/2015赛季至现在的联赛数据(爬取网页为https://understat.com/player/{player_id},其中C罗的player_id为2371,梅西的player_id为2097,内马尔的player_id为2099,姆巴佩player_id为3423),包括射门位置(X, Y)、预期进球(进球概率)(xG)、射门结果(result)、射门方式(shotType)、赛季(season)。
射门结果(result)包括:被截(被球员拦截)、进球、射偏、救球(被守门员扑救)、柱射(射在门柱上)。
射门类型(shotType)包括:头球射门、左脚射门、右脚射门及身体其他部位射门。
射门结果Result分为五种:1)Goal(进球);2)Shoton post(射在门柱上);3)Savedshot(守门员守住了);4)Blockedshot(被拦截);5)Missedshot(射偏)。
姆巴佩的数据从2015/2016赛季开始,目录是2022、2023赛季(见图2)。
图2 Kylian Mbappé页面
二、网页分析
单击鼠标右键查看原代码,发现有多个超长字符串变量在<script>...</script>标签中。
按顺序第四个<script>是射门数据(见图3)。
图3 页面代码(局部)
要抓取的是
<script>
var shotData = JSON.parse('...')
</script>
结构中引号中的内容。内容为JSON结构数据,注意:JSON是字符串形式,尽管很像字典,但不是Python字典,对Python就是字符串,但可以用json模块进行转换。
json.loads()==>将JSON字符串转为字典或字典列表
json.dumps()==>将字典或字典列表转为JSON字符串
JSON可以有两种表示结构:对象和数组
对象结构以"{"大括号开始,以"}"大括号结束。中间部分由以","来分割开键值对(key/value)代码表示如下:
{
key1:value1,
key2:value2,
...
}
其中:关键字需要是不变类型,比如:字符串;而值可以是其他任何数据,比如:字符串,数值,布尔值,对象或者是null。
数组结构以"["方括号开始,"]"方括号结束。中间部分用","分割对象。代码表示如下:
[
{
key1:value1,
key2:value2
},
{
key3:value3,
key4:value4
}
]
可用用Python的以字典为元素的列表表示(Python二维数据)。
三、数据提取与解码
本次爬取的网页用的是JSON数组结构,转换成Python结构后为列表,元素为字典。
截取变量中的头尾两小节数据(C罗的数据),列于下面作前期分析,从数据看是字符串形式的Python单字节十六进制数(十进制值大于32且小于128,ASCII码)+数据,需先转化为Python字节流,再解码为JSON串,然后用json.loads()转换为Python字典列表。
>>> a = r'\x5B\x7B\x22id\x22\x3A\x2232535\x22,\x22minute\x22\x3A\x2218\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.845\x22,\x22Y\x22\x3A\x220.49900001525878906\x22,\x22xG\x22\x3A\x220.06659495085477829\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22SetPiece\x22,\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x225834\x22,\x22h_team\x22\x3A\x22Real\x20Madrid\x22,\x22a_team\x22\x3A\x22Cordoba\x22,\x22h_goals\x22\x3A\x222\x22,\x22a_goals\x22\x3A\x220\x22,\x22date\x22\x3A\x222014\x2D08\x2D25\x2019\x3A00\x3A00\x22,\x22player_assisted\x22\x3A\x22Luka\x20Modric\x22,\x22lastAction\x22\x3A\x22Pass\x22\x7D,\x7B\x22id\x22\x3A\x22422004\x22,\x22minute\x22\x3A\x2223\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.885\x22,\x22Y\x22\x3A\x220.5\x22,\x22xG\x22\x3A\x220.7612988352775574\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22Penalty\x22,\x22season\x22\x3A\x222020\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x2215790\x22,\x22h_team\x22\x3A\x22Juventus\x22,\x22a_team\x22\x3A\x22Inter\x22,\x22h_goals\x22\x3A\x223\x22,\x22a_goals\x22\x3A\x222\x22,\x22date\x22\x3A\x222021\x2D05\x2D15\x2016\x3A00\x3A00\x22,\x22player_assisted\x22\x3Anull,\x22lastAction\x22\x3A\x22Standard\x22\x7D\x5D'
>>> b = eval("b'" + a + "'") # 将字符串放入b'...'中,用eval()转换为字节流
>>> b
b'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"Luka Modric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'
>>> type(b) # 测试结果为字节流
<class 'bytes'>
>>> b.decode() # decode()解码为字符串,因为是ASCII码所有编码都兼容
'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"LukaModric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'
其中重要数据包含射门位置(X、Y)、预期进球(xG)、射门结果(result)、赛季(season)。预期进球即预测进球概念,xG=1则100%进球,X、Y为相对值,值介于0~1,matplotlib绘图则是0~100,所以要放大100倍,result=Goal为进球,season=2014表示2014/2015赛季。
>>> import json # 导入json模块
>>> json.loads(b.decode()) # JSON数据转换为字典列表
[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'Cristiano Ronaldo','h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season': '2014','shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid', 'a_team':'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-25 19:00:00','player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id': '422004','minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]
>>> json.loads(b) # 其实不解码也能转换为字典列表
[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'CristianoRonaldo', 'h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season':'2014', 'shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid','a_team': 'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-2519:00:00', 'player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id':'422004', 'minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]
>>> type(json.loads(b)) # 结果为列表
<class 'list'>
好了!有了上面的分析和基础知识后,就要开始爬网页,爬网页用requests模块的get()方法,从网页中提取<script>...</script>标签的内容用BeautifulSoup4模块的BeautifulSoup类的find_all()方法。
四、matplotlib中的绘制散点图——scatter()方法
pyplot模块中的scatter()函数用于绘制散点图,其语法格式如下:
matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, camp=None,
norm=None, vmin=None, vmax=None,alpha=None, linewidths=None,
verts=None, edgecolors=None, hold=None, data=None,**kwargs)
式中常用的参数含义如下:
x,y:表示 x 轴和 y 轴对应的数据。
s:指定点的大小。若传入的是一维数组,则表示每个点的大小。
c:指定散点的颜色,若传入的是一维数组,则表示每个点的颜色。
marker:表示绘制的散点类型(控制点的形状),见表1。
alpha:控制点的透明度,接受0~1之间的小数。在数据量大的时候设置较小的alpha值,然后调整一下s值,这样产生重叠效果使得数据的聚集特征会很好地显示出来。
cmap:调整渐变色或者颜色列表的种类。
表1 marker设置与对应符号及说明
五、完整代码
完整代码如下:
- #############################################
- # 设计 Zhang Ruilin 创建 2021-01-10 18:35 #
- # 修订 2022-12-28 10:13 #
- # Matplotlib 绘制足球运动员的射门数据分布图 #
- #############################################
- import requests # 爬网页工具
- from bs4 import BeautifulSoup # 分析网页、提取信息工具
- import json # JSON转字典、字典转JSON
- import pandas as pd # 大数据处理工具
- import matplotlib.pyplot as plt # 类似matlab的绘图工具包
- import numpy as np # 科学计算数学函数库
- import matplotlib as mpl
- import mplsoccer # 绘制足球场工具
-
- # 基利安·姆巴佩(Kylian Mbappé)的player-id为3423
- url = 'https://understat.com/player/3423' # 请求数据
- html = requests.get(url) # 爬取网页
- # 解析处理数据
- soup_parse = BeautifulSoup(html.content, 'lxml') # 提取内容
- scripts = soup_parse.find_all('script') # 查找script标签返回一个列表类型
- strings = scripts[3].string # 取含shotsData变量的结果,转字符串
- _start = strings.index("('")+2 # 起点为JSON.parse('后的字符
- _end = strings.index("')") # 终止为\x5D')的'前,不含“'”
- json_data = strings[_start:_end] # 截取变量中''之间部分(JSON数据)
- json_data = eval("b'"+json_data+"'") # 将十六进制字符串\xYY转为字节流
- data = json.loads(json_data) # 转换为字典列表
- # 处理数据, 包含射门位置(X,Y)、预期进球(xG)、射门结果(result)、赛季(season)
- x, y, xg, result, season = [], [], [], [], []
- for _dic in data: # 提取X、Y、xG、result、season
- x.append(_dic['X'])
- y.append(_dic['Y'])
- xg.append(_dic['xG'])
- result.append(_dic['result'])
- season.append(_dic['season'])
- columns = ['X', 'Y', 'xG', 'Result', 'Season']
- df_data = pd.DataFrame([x, y, xg, result, season], index=columns)
- df_data = df_data.T # 对数据进行行列交换(转置)
- df_data = df_data.apply(pd.to_numeric, errors='ignore') # 将数值字符串转换为数值型
- df_data['X'] = df_data['X'].apply(lambda x: x*100) # 放大100倍,得到最终结果
- df_data['Y'] = df_data['Y'].apply(lambda x: x*100) # 原数据为相对数据0~1
- # df_data.to_csv(r'd:/Mbappé_shooting.csv') # 保存为文件
- background, text_color = 'lightgray', 'black' # 定义背景色(浅灰色)、文字色(黑色)
- mpl.rcParams['text.color'] = text_color # 设置文字颜色
- mpl.rcParams['font.sans-serif'] = ['simsun'] # 设置默认字体为宋体
- mpl.rcParams['legend.fontsize'] = 15 # 图例字号15磅
- fig, ax = plt.subplots(figsize=(7, 5.6)) # 新建画布7×5.6英寸
- ax.axis('off') # 关闭坐标轴(不显示坐标轴)
- fig.set_facecolor(background) # 用背景色填充
- pitch = mplsoccer.VerticalPitch(half=True, pitch_type='opta', line_zorder=3,
- pitch_color='grass') # 画垂直方向半个足球场
- axes = fig.add_axes((0.05, 0.06, 0.9, 0.9)) # 绘图范围。左下角(0.05, 0.06),
- axes.patch.set_facecolor(background) # ↑宽、高各为90%
- pitch.draw(ax=axes)
- season=2021 # 设置赛季。范围2014~运行年-1
- df = df_data.loc[df_data['Season'] == season] # 筛选指定赛季数据
- # 某赛季, 球员射门位置未得分散点图(df['Result']!='Goal'), 青色,透明度0.5
- pitch.scatter(df[df['Result'] != 'Goal']['X'], df[df['Result'] != 'Goal']['Y'],
- s=np.sqrt(df[df['Result'] != 'Goal']['xG'])*100, marker='o', alpha=0.5,
- edgecolor='black', facecolor='cyan', ax=axes, label='未进球')
- # 某赛季, 球员射门位置得分散点图(df['Result']=='Goal'), 深红色,透明度0.7
- pitch.scatter(df[df['Result'] == 'Goal']['X'], df[df['Result'] == 'Goal']['Y'],
- s=np.sqrt(df[df['Result'] == 'Goal']['xG'])*100,marker='o', alpha=0.7,
- edgecolor='black', facecolor='crimson', ax=axes, label='进球得分')
- axes.legend(loc='lower right') # 添加图例
- # 输出文字
- axes.text(25, 64, f"预期进球:{sum(df['xG']):.2f}", weight='bold',
- size=14) # 期望进球df['xG']之和
- axes.text(25, 61, f"得分次数:{len(df[df['Result'] == 'Goal'])}",
- weight='bold', size=14) # 条件df['Result'] == 'Goal'的行数
- axes.text(25, 58, f"射门次数:{len(df)}", weight='bold', size=14) # 本赛季数据行数
- axes.text(95, 60, f'{season}-{season+1}赛季', weight='bold', size=18)
-
- plt.show()
执行结果如图4所示。
图4 Kylian Mbappé射门位置分布图
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。