赞
踩
这个项目实战系列主要是跟着网络上的教程来做的,主要参考《跟着迪哥学习机器学习》中的思路和具体实现代码,但是书中使用到的应该是python2的版本,有一些代码也有问题,有的是省略了一些关键的步骤,有的是语法的问题,总之就是并不是直接照着敲就能够一路运行下来的。这里整理了能够运行的代码和数据集(链接容易挂,需要见评论区置顶)。
随机森林建模->特征选择->效率对比->参数调优
import pandas as pd
features = pd.read_csv("temps.csv")
features.head(5)
year | month | day | week | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|---|
0 | 2019 | 1 | 1 | Fri | 45 | 45 | 45.6 | 45 | 29 |
1 | 2019 | 1 | 2 | Sat | 44 | 45 | 45.7 | 44 | 61 |
2 | 2019 | 1 | 3 | Sun | 45 | 44 | 45.8 | 41 | 56 |
3 | 2019 | 1 | 4 | Mon | 44 | 41 | 45.9 | 40 | 53 |
4 | 2019 | 1 | 5 | Tues | 41 | 40 | 46.0 | 44 | 41 |
观察数据规模
print("数据维度:",features.shape)
数据维度: (348, 9)
features.describe()
year | month | day | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|
count | 348.0 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 |
mean | 2019.0 | 6.477011 | 15.514368 | 62.652299 | 62.701149 | 59.760632 | 62.543103 | 60.034483 |
std | 0.0 | 3.498380 | 8.772982 | 12.165398 | 12.120542 | 10.527306 | 11.794146 | 15.626179 |
min | 2019.0 | 1.000000 | 1.000000 | 35.000000 | 35.000000 | 45.100000 | 35.000000 | 28.000000 |
25% | 2019.0 | 3.000000 | 8.000000 | 54.000000 | 54.000000 | 49.975000 | 54.000000 | 47.750000 |
50% | 2019.0 | 6.000000 | 15.000000 | 62.500000 | 62.500000 | 58.200000 | 62.500000 | 60.000000 |
75% | 2019.0 | 10.000000 | 23.000000 | 71.000000 | 71.000000 | 69.025000 | 71.000000 | 71.000000 |
max | 2019.0 | 12.000000 | 31.000000 | 117.000000 | 117.000000 | 77.400000 | 92.000000 | 95.000000 |
数据:
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2019, 1, 1, 0, 0),
datetime.datetime(2019, 1, 2, 0, 0),
datetime.datetime(2019, 1, 3, 0, 0),
datetime.datetime(2019, 1, 4, 0, 0),
datetime.datetime(2019, 1, 5, 0, 0)]
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight') # 绘图风格
布局视图,初步分析需要四项指标:分别为最高气温的标签值、前天、昨天、朋友预测的气温最高值,四个图。
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
fig.autofmt_xdate(rotation=45)
# 最高气温的标签值
ax1.plot(dates,features['actual'])
ax1.set_xlabel('');ax1.set_ylabel('Temperature');ax1.set_title('Max Temp')
# 昨天的最高温度值
ax2.plot(dates,features['temp_1'])
ax2.set_xlabel('');ax2.set_ylabel('Temperature');ax2.set_title('Yesterday Max Temp')
# 前天的最高温度值
ax3.plot(dates,features['temp_2'])
ax3.set_xlabel('');ax3.set_ylabel('Temperature');ax3.set_title('Two Days Prior Max Temp')
# 朋友预测的最高温度值
ax4.plot(dates,features['friend'])
ax4.set_xlabel('');ax4.set_ylabel('Temperature');ax4.set_title('Friend Forcast')
plt.tight_layout(pad=2)
对于一周的标签,并不是数值特征,需要将其转换为特征编码
# 独热编码
features = pd.get_dummies(features) #自动转换,自动添加后缀
features.head(5)
year | month | day | temp_2 | temp_1 | average | actual | friend | week_Fri | week_Mon | week_Sat | week_Sun | week_Thurs | week_Tues | week_Wed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019 | 1 | 1 | 45 | 45 | 45.6 | 45 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2019 | 1 | 2 | 44 | 45 | 45.7 | 44 | 61 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 2019 | 1 | 3 | 45 | 44 | 45.8 | 41 | 56 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 2019 | 1 | 4 | 44 | 41 | 45.9 | 40 | 53 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 2019 | 1 | 5 | 41 | 40 | 46.0 | 44 | 41 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
print(help(pd.get_dummies))
Help on function get_dummies in module pandas.core.reshape.reshape: get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame' Convert categorical variable into dummy/indicator variables. Parameters ---------- data : array-like, Series, or DataFrame Data of which to get dummy indicators. prefix : str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, `prefix` can be a dictionary mapping column names to prefixes. prefix_sep : str, default '_' If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with `prefix`. dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If `columns` is None then all the columns with `object` or `category` dtype will be converted. sparse : bool, default False Whether the dummy-encoded columns should be backed by a :class:`SparseArray` (True) or a regular NumPy array (False). drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. .. versionadded:: 0.23.0 Returns ------- DataFrame Dummy-coded data. See Also -------- Series.str.get_dummies : Convert Series to dummy codes. Examples -------- >>> s = pd.Series(list('abca')) >>> pd.get_dummies(s) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 >>> s1 = ['a', 'b', np.nan] >>> pd.get_dummies(s1) a b 0 1 0 1 0 1 2 0 0 >>> pd.get_dummies(s1, dummy_na=True) a b NaN 0 1 0 0 1 0 1 0 2 0 0 1 >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]}) >>> pd.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 1 0 0 1 0 1 2 0 1 1 0 0 2 3 1 0 0 0 1 >>> pd.get_dummies(pd.Series(list('abcaa'))) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 1 0 0 >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True) b c 0 0 0 1 1 0 2 0 1 3 0 0 4 0 0 >>> pd.get_dummies(pd.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0 None
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 特征中去除标签
features = features.drop('actual',axis=1) # 按照列去掉
# 名字单独保留
feature_list = list(features.columns)
# 转换为合适的格式
features = np.array(features)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=42)
print('训练集特征:',train_features.shape)
print('训练集标签:',train_labels.shape)
print('测试集标签:',test_features.shape)
print('测试机标签:',test_labels.shape)
训练集特征: (261, 14)
训练集标签: (261,)
测试集标签: (87, 14)
测试机标签: (87,)
先建立1000棵决策树尝试,参数默认
先使用MAPE指标进行评估,这个评估是平均绝对百分误差,由于样本量比较少,训练模型的速率较快
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000,random_state=42)
rf.fit(train_features,train_labels)
predictions = rf.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * (errors / test_labels)
print('MAPE:',np.mean(mape))
MAPE: 6.016378550202468
from sklearn.tree import export_graphviz
import pydot
import os
# os.environ["PATH"] += os.pathsep + 'S:/Graphviz/bin/'
tree = rf.estimators_[5]
export_graphviz(tree,out_file="tree.dot",feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('./tree.dot')
graph.write_png('./tree.png')
# 限制树模型
rf_small = RandomForestRegressor(n_estimators=10,max_depth=3,random_state=42)
rf_small.fit(train_features,train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small,out_file='small_tree.dot',feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
# 决策树特征重要性
importances = list(rf.feature_importances_)
# 格式转换
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)]
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印
[print('Variable:{:20} importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1 importance: 0.69 Variable:average importance: 0.2 Variable:day importance: 0.03 Variable:friend importance: 0.03 Variable:temp_2 importance: 0.02 Variable:month importance: 0.01 Variable:year importance: 0.0 Variable:week_Fri importance: 0.0 Variable:week_Mon importance: 0.0 Variable:week_Sat importance: 0.0 Variable:week_Sun importance: 0.0 Variable:week_Thurs importance: 0.0 Variable:week_Tues importance: 0.0 Variable:week_Wed importance: 0.0 [None, None, None, None, None, None, None, None, None, None, None, None, None, None]
# 绘制为直方图
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation='vertical')
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance');plt.xlabel('Variable');plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')
观察发现,使用重要性最好的特征建模,可能会有更好的结果,虽然不一定成功,但是速度一定会更快
# 尝试使用最重要的两个特征
rf_most_important = RandomForestRegressor(n_estimators=1000,random_state=42)
# 最重要特征
important_indices = [feature_list.index('temp_1'),feature_list.index('average')]
train_important = train_features[:,important_indices]
test_important = test_features[:,important_indices]
# 重新训练模型
rf_most_important.fit(train_important,train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions-test_labels)
# 评估结果,保留两位小数
print('Mean Absolute Error:',round(np.mean(errors),2),'%')
mape = np.mean(100*(errors/test_labels))
print('mape:',mape)
Mean Absolute Error: 3.92 %
mape: 6.243108595734665
这里发现,mape的值从6.0上升到6.2,并没有下降,说明不能只选择最重要的特征
# 日期 months = features[:,feature_list.index('month')] days = features[:,feature_list.index('day')] years = features[:,feature_list.index('year')] # 转换日期 dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year, month, day in zip(years,months,days)] dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates] # 创建表格保存日期和其对应的标签数据 true_data = pd.DataFrame(data={'date':dates,'actual':labels}) # 另一个表格表示日期和对应预测值 months = test_features[:,feature_list.index('month')] days = test_features[:,feature_list.index('day')] years = test_features[:,feature_list.index('year')] test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)] test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates] predictions_data = pd.DataFrame(data = {'date':test_dates,'prediction':predictions}) # 真实值 plt.plot(true_data['date'],true_data['actual'],'b-',label='actual') # 预测值 plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label='prediction') plt.xticks(rotation='60')
(array([17897., 17956., 18017., 18078., 18140., 18201., 18262.]),
[Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, '')])
import pandas as pd
features = pd.read_csv('temps_extended.csv')
features.head(5)
year | month | day | weekday | ws_1 | prcp_1 | snwd_1 | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 1 | 1 | Sat | 4.92 | 0.00 | 0 | 36 | 37 | 45.6 | 40 | 40 |
1 | 2011 | 1 | 2 | Sun | 5.37 | 0.00 | 0 | 37 | 40 | 45.7 | 39 | 50 |
2 | 2011 | 1 | 3 | Mon | 6.26 | 0.00 | 0 | 40 | 39 | 45.8 | 42 | 42 |
3 | 2011 | 1 | 4 | Tues | 5.59 | 0.00 | 0 | 39 | 42 | 45.9 | 38 | 59 |
4 | 2011 | 1 | 5 | Wed | 3.80 | 0.03 | 0 | 42 | 38 | 46.0 | 45 | 39 |
print('数据规模',features.shape)
数据规模 (2191, 12)
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2011, 1, 1, 0, 0),
datetime.datetime(2011, 1, 2, 0, 0),
datetime.datetime(2011, 1, 3, 0, 0),
datetime.datetime(2011, 1, 4, 0, 0),
datetime.datetime(2011, 1, 5, 0, 0)]
# 对新特征进行可视化展示 fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(15,10)) fig.autofmt_xdate(rotation=45) # 平均最高气温 ax1.plot(dates,features['average']) ax1.set_xlabel('');ax1.set_ylabel('Tempertature (F)');ax1.set_title('Historical Avg Max Temp') # 风速 ax2.plot(dates,features['ws_1'],'r-') ax2.set_xlabel('');ax2.set_ylabel('Wind Speed (mph))');ax2.set_title('Prior Wind Speed') # 降水 ax3.plot(dates,features['prcp_1'],'r-') ax3.set_xlabel('Date');ax3.set_ylabel('Precipitation (in)');ax3.set_title('Prior Precipitation') # 积雪 ax4.plot(dates,features['snwd_1'],'ro') ax4.set_xlabel('Date');ax4.set_ylabel('Snow Depth (in)');ax4.set_title('Prior Snow Depth') plt.tight_layout(pad=2)
关于这份数据:
# 季节变量
seasons = []
for month in features['month']:
if month in [1,2,12]:
seasons.append('winter')
elif month in [3,4,5]:
seasons.append('spring')
elif month in [6,7,8]:
seasons.append('summer')
elif month in [9,10,11]:
seasons.append('fall')
reduced_features = features[['temp_1','prcp_1','average','actual']]
reduced_features['season'] = seasons
<ipython-input-23-b690764ce900>:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
reduced_features['season'] = seasons
import seaborn as sns
sns.set(style='ticks',color_codes=True)
# 主题
palette = sns.xkcd_palette(['dark blue','dark green','gold','orange'])
# pairplot绘图
sns.pairplot(reduced_features,hue='season',diag_kind='kde',palette=palette,plot_kws=dict(alpha=0.7),diag_kws=dict(shade=True))
<seaborn.axisgrid.PairGrid at 0x232067fa190>
分析的方法:不同的颜色代表不同的季节(hue),主对角线是同一个参数,表示数值在不同季节的分布。其他都是散点图,代表了各个特征之间的相关性
# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual',axis=1)
# 特征名字留着备用
feature_list = list(features.columns)
# 转换为所需格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=0)
print("训练集特征:",train_features.shape)
print("训练集标签:",train_labels.shape)
print("测试集特征:",test_features.shape)
print("测试集标签:",test_labels.shape)
训练集特征: (1643, 17)
训练集标签: (1643,)
测试集特征: (548, 17)
测试集标签: (548,)
虽然是新样本,但是需要使用相同的测试集来对比结果。
# 对之前数据惊喜能够处理,对比
import pandas as pd
import numpy as np
# 统一特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if feature not in ['ws_1','prcp_1','snwd_1']]
# 重新读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 数据标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual',axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)
# 数据集切分 from sklearn.model_selection import train_test_split original_train_features,original_test_features,original_train_labels,original_test_labels = train_test_split(original_features,original_labels,test_size=0.25,random_state=42) # 数据建模 from sklearn.ensemble import RandomForestRegressor # 同样参数和随机种子 rf = RandomForestRegressor(n_estimators=100,random_state=0) # 老数据集 rf.fit(original_train_features,original_train_labels) # 统一使用一个测试集,为了公平 predictions = rf.predict(test_features[:,original_feature_indices]) errors = abs(predictions-test_labels) print('平均温度误差:',round(np.mean(errors),2),'°') mape = 100 *(errors/test_labels) # 为了观察设定准确率 accuracy = 100 -np.mean(mape) print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.68 °
Accuracy: 92.19 %
上面是样本较少的结果,这次观察样本增多会更好吗
from sklearn.ensemble import RandomForestRegressor
# 保证标签一致 这里改变了原白哦钱的维度方便计算
original_train_changeed_features = train_features[:,original_feature_indices]
original_test_changed_features = test_features[:,original_feature_indices]
rf = RandomForestRegressor(n_estimators=100,random_state=0)
rf.fit(original_train_changeed_features,train_labels)
# 预测
baseline_predictions = rf.predict(original_test_changed_features)
# 结果
baseline_errors = abs(baseline_predictions-test_labels)
print('平均温度误差:',round(np.mean(baseline_errors),2),'%')
baseline_mape = 100 * np.mean(baseline_errors/test_labels)
# 准确率
baseline_accuracy = 100 - baseline_mape
print('Accuracy:',round(baseline_accuracy,2),'%')
平均温度误差: 4.2 %
Accuracy: 93.12 %
加入两次比较没有比较的新的天气特征
from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100,random_state=0)
rf_exp.fit(train_features,train_labels)
# 同一测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:',round(np.mean(errors),2),"%")
mape = np.mean(100*(errors/test_labels))
improvement_baseline = 100 * abs(mape-baseline_mape) / baseline_mape
print('特征增多以后模型效果变化:',round(improvement_baseline,2),'%')
# 准确率
accuracy = 100 - mape
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.05 %
特征增多以后模型效果变化: 3.34 %
Accuracy: 93.35 %
importances = list(rf_exp.feature_importances_)
# 名字和数值拼接在一起
feature_importances = [(feature,round(importance,2)) for feature,importance in zip(feature_list,importances)]
# 排序
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印结果
[print('Variable:{:20} Importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1 Importance: 0.85 Variable:average Importance: 0.05 Variable:ws_1 Importance: 0.02 Variable:friend Importance: 0.02 Variable:year Importance: 0.01 Variable:month Importance: 0.01 Variable:day Importance: 0.01 Variable:prcp_1 Importance: 0.01 Variable:temp_2 Importance: 0.01 Variable:snwd_1 Importance: 0.0 Variable:weekday_Fri Importance: 0.0 Variable:weekday_Mon Importance: 0.0 Variable:weekday_Sat Importance: 0.0 Variable:weekday_Sun Importance: 0.0 Variable:weekday_Thurs Importance: 0.0 Variable:weekday_Tues Importance: 0.0 Variable:weekday_Wed Importance: 0.0 [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
# 可视化重要指标
plt.style.use('fivethirtyeight')
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation="vertical",color="r",edgecolor="k",linewidth=1.2)
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')
具体选择几个特征还是比较模糊,可以先把特征按照其重要性进行排序后再计算累计值,设置阈值,观察需要多少特征累加在一起后,特征重要性的累加值才能超过阈值
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values,cumulative_importances,'g-')
plt.hlines(y=0.95,xmin=0,xmax=len(sorted_importances),color='r',linestyles='dashed')
plt.xticks(x_values,sorted_features,rotation='vertical')
plt.xlabel('Variable');plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
Text(0.5, 1.0, 'Cumulative Importances')
当使用前五个特征,总体的准确率累加值超过0.95,如果只使用这5个特征建模,观察结果
important_feature_names = [feature[0] for feature in feature_importances[0:5]] # 名字 important_indices = [feature_list.index(feature) for feature in important_feature_names] # 训练集 important_train_features = train_features[:,important_indices] important_test_features = test_features[:,important_indices] # 数据维度 print("important train features shape:",important_train_features.shape) print("important test features shape:",important_test_features.shape) # 训练模型 rf_exp.fit(important_train_features,train_labels) # 同样的测试集 predictions = rf_exp.predict(important_test_features) # 评估 errors = abs(predictions-test_labels) print('平均温度误差:',round(np.mean(errors),2),"°") mape = 100*(errors/test_labels) accuracy = 100 - np.mean(mape) print('Accuracy:',round(accuracy,2),"%")
important train features shape: (1643, 5)
important test features shape: (548, 5)
平均温度误差: 4.11 °
Accuracy: 93.28 %
虽然没有提升效率,那么观察一下在模型时间效率上面有没有提高·
import time
all_features_time = []
for _ in range(10):
start_time = time.time()
rf_exp.fit(train_features,train_labels)
all_features_predictions = rf_exp.predict(test_features)
end_time = time.time()
all_features_time.append(end_time-start_time)
all_features_time = np.mean(all_features_time)
print("使用所有特征与测试的平均时间消耗:",round(all_features_time,2),'s')
使用所有特征与测试的平均时间消耗: 0.69 s
# 只选用重要特征训练时
reduced_features_time = []
for _ in range(10):
start_time = time.time()
rf_exp.fit(important_train_features,train_labels)
reduced_features_predictions = rf_exp.predict(important_test_features)
end_time = time.time()
reduced_features_time.append(end_time-start_time)
reduced_features_time = np.mean(reduced_features_time)
print("使用重要特征与测试的平均时间消耗:",round(reduced_features_time,2),'s')
使用重要特征与测试的平均时间消耗: 0.42 s
# 原始模型时间效率
original_features_time =[]
for _ in range(10):
start_time =time.time()
rf.fit(original_train_features,original_train_labels)
original_features_predictions =rf.predict(test_features[:,original_feature_indices])
end_time =time.time()
original_features_time.append(end_time -start_time)
original_features_time =np.mean(original_features_time)
print("使用原始模型测试的平均时间消耗:",round(original_features_time,2),'s')
使用原始模型测试的平均时间消耗: 0.18 s
# 对比展示
all_accuracy = 100 * (1-np.mean(abs(all_features_predictions-test_labels)/test_labels))
reduced_accuracy = 100 * (1-np.mean(abs(reduced_features_predictions-test_labels)/test_labels))
# 保存结果并展示
comparision = pd.DataFrame({'features':['all(17)','reduced(5)'],
'runtime':[round(all_features_time,2),round(reduced_features_time,2)],
'accuracy':[round(all_accuracy,2),round(reduced_accuracy,2)]})
comparision[['features','accuracy','runtime']]
features | accuracy | runtime | |
---|---|---|---|
0 | all(17) | 93.35 | 0.69 |
1 | reduced(5) | 93.28 | 0.42 |
准确率只是为了方面安眠自己定义,用于对比分析。结果如上表所见,准确率基本没有发生明显变化,时间效率上倒是有很大的提升。
# 时间效率可能会比准确率更加优先考虑
relative_accuracy_decrease = 100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:',round(relative_accuracy_decrease,3),"%")
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print("相对时间效率提升:",round(relative_runtime_decrease,3),"%")
相对accuracy提升: 0.071 %
相对时间效率提升: 39.17 %
# 原模型的预测温度对比
original_mae = np.mean(abs(original_features_predictions -test_labels))
# 所有特征预测温度对比
exp_all_mae = np.mean(abs(all_features_predictions -test_labels))
# 重要特征预测温度对比
exp_reduced_mae = np.mean(abs(reduced_features_predictions -test_labels))
# 原模型的准确率
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) /test_labels))
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'],
'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],
'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
# 汇聚所有实验结果 fig, (ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3,figsize=(16,5),sharex=True) # X轴 x_values = [0,1,2] labels = list(model_comparison['model']) plt.xticks(x_values,labels) # 字体大小 fontdict = {'fontsize':18} fontdict_yaxis = {'fontsize':14} # 预测温度和真实温度的比对比 ax1.bar(x_values,model_comparison['error (degrees)'], color=['b','r','g'],edgecolor='k',linewidth=1.5) ax1.set_ylim(bottom=3.5, top=4.5) ax1.set_ylabel('Error (degree) (F)',fontdict=fontdict_yaxis) ax1.set_title('Model Error Comparison',fontdict=fontdict) # 准确率对比 ax2.bar(x_values,model_comparison['accuracy'],color=['b','r','g'],edgecolor='k',linewidth=1.5) ax2.set_ylim(bottom=92, top=94) ax2.set_ylabel('Accuracy (%)',fontdict=fontdict_yaxis) ax2.set_title('Model Accuracy Comparision',fontdict=fontdict) # 时间效率对比 ax3.bar(x_values,model_comparison['run_time (s)'], color=['b','r','g'],edgecolor='k',linewidth=1.5) ax3.set_ylim(bottom=0,top=1) ax3.set_ylabel('run_time (s)',fontdict=fontdict_yaxis) ax3.set_title('Model Run-Time Comparison',fontdict=fontdict) plt.show()
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
rf = RandomForestRegressor(random_state=42)
pprint(rf.get_params())
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}
所选择的参数值的候选根据实际情况,每一个参数的取值范围都需要好好把控
from sklearn.model_selection import RandomizedSearchCV # 建立树的个数 n_estimators = [int(x) for x in np.linspace(start=200,stop=2000,num=10)] # 最大特征的选择方法 max_features = ['auto','sqrt'] # 树最大深度 max_depth = [int(x) for x in np.linspace(10,20,num=2)] max_depth.append(None) # 节点最小分裂所需要的样本个数 min_samples_split = [2,5,10] # 叶子节点最小的样本数 min_samples_leaf = [1,2,4] # 样本采样方法 bootstrap = [True,False] # 随机参数空间 random_grid ={'n_estimators':n_estimators, 'max_features':max_features, 'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf, 'bootstrap':bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, # 指定调参模型
param_distributions=random_grid, # 指定候选参数列表
n_iter=100, # 随机选择参数组合的个数,这里是随机选择100组,找这中间最好的
scoring='neg_mean_absolute_error', # 评估方法
cv=3, # 交叉验证
verbose=2, # 打印信息的数量
random_state=42, # 随机种子,随便选
n_jobs=-1) # 多线程数目,如果-1代表使用所有线程
# 寻找开始
rf_random.fit(train_features,train_labels)
rf_random.best_params_
# 评估结果
def evaluate(model,test_features,test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('平均气温误差:',np.mean(errors))
print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features,train_labels)
evaluate(base_model,test_features,test_labels)
# 新配方
best_random = rf_random.best_estimator_
evaluate(best_random,test_features,test_labels)
地毯式搜索最优参数,注意这里的地毯也是根据之前的随机调参得到的大致范围
{'n_estimators': 1800,
'min_samples_split': 10,
'min_samples_leaf': 4,
'max_features': 'auto',
'max_depth': None,
'bootstrap': True}
from sklearn.model_selection import GridSearchCV # 候选参数空间 param_grid = { 'n_estimators':[1600,1700,1800,1900,2000], 'max_features':['auto'], 'max_depth':[8,10,12], 'min_samples_split':[3,5,7], 'min_samples_leaf':[2.3,4,5,6], 'bootstrap':[True] } # 基本算法模型 rf = RandomForestRegressor() # 网格搜索 grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='neg_mean_absolute_error', cv=3, n_jobs=-1, verbose=2) # 搜索开始 grid_search.fit(train_features,train_labels)
网格搜索的时候,如果参数空间过大,遍历次数过多时,将所有可能分成不同的小组进行,比较各个组的最优
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials from sklearn.model_selection import cross_val_score def hyperopt_train_test(params): clf = RandomForestRegressor(**params) return cross_val_score(clf,train_features,train_labels).mean() max_depth = [i for i in range(10,20)] # max_depth.append(None) space4rf = { 'max_depth': hp.choice('max_depth', max_depth), 'max_features': hp.choice('max_features', ['auto','sqrt']), 'min_samples_split':hp.choice('min_samples_split',range(5,20)), 'min_samples_leaf':hp.choice('min_samples_leaf',range(2,10)), 'n_estimators': hp.choice('n_estimators', range(1000,2000)), 'bootstrap':hp.choice('bootstrap',[True,False]) } best = 0 def f(params): global best acc = hyperopt_train_test(params) if acc > best: best = acc print('new best:', best, params) return {'loss': -acc, 'status': STATUS_OK} trials = Trials() best = fmin(f, space4rf, algo=tpe.suggest, max_evals=100, trials=trials) print("best:",best)
new best: 0.8670547390932652 {'bootstrap': True, 'max_depth': 18, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 11, 'n_estimators': 1942} new best: 0.8679298658104889 {'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 7, 'min_samples_split': 10, 'n_estimators': 1734} new best: 0.8684034111523674 {'bootstrap': True, 'max_depth': 14, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 19, 'n_estimators': 1766} new best: 0.8685636302610934 {'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1439} new best: 0.8685767383919801 {'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1404} new best: 0.8685919671759731 {'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 15, 'n_estimators': 1830} new best: 0.8686049353034605 {'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1099} new best: 0.8686240725941452 {'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1088} 100%|█████████████████████████████████████████████| 100/100 [39:13<00:00, 26.47s/trial, best loss: -0.8686240725941452] best: {'bootstrap': 0, 'max_depth': 9, 'max_features': 0, 'min_samples_leaf': 7, 'min_samples_split': 11, 'n_estimators': 88}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。