当前位置:   article > 正文

【python3】基于随机森林的气温预测

基于随机森林的气温预测

前言

这个项目实战系列主要是跟着网络上的教程来做的,主要参考《跟着迪哥学习机器学习》中的思路和具体实现代码,但是书中使用到的应该是python2的版本,有一些代码也有问题,有的是省略了一些关键的步骤,有的是语法的问题,总之就是并不是直接照着敲就能够一路运行下来的。这里整理了能够运行的代码和数据集(链接容易挂,需要见评论区置顶)。

系列导航

思路

随机森林建模->特征选择->效率对比->参数调优

import pandas as pd

features = pd.read_csv("temps.csv")
features.head(5)
  • 1
  • 2
  • 3
  • 4
yearmonthdayweektemp_2temp_1averageactualfriend
0201911Fri454545.64529
1201912Sat444545.74461
2201913Sun454445.84156
3201914Mon444145.94053
4201915Tues414046.04441

具体任务

  1. 基本建模任务:
    • 数据预处理
    • 特征展示
    • 建模+可视化
  2. 数据样本量的个数对结果的影响:增加数据样本个数,观察结果变化,然后重新考虑特征工程,引入新特征后,再观察结果
  3. 对算法进行调参,找到最合适的参数:使用两种经典的调参方法

特征可视化与预处理

观察数据规模

print("数据维度:",features.shape)
  • 1
数据维度: (348, 9)
  • 1
features.describe()
  • 1
yearmonthdaytemp_2temp_1averageactualfriend
count348.0348.000000348.000000348.000000348.000000348.000000348.000000348.000000
mean2019.06.47701115.51436862.65229962.70114959.76063262.54310360.034483
std0.03.4983808.77298212.16539812.12054210.52730611.79414615.626179
min2019.01.0000001.00000035.00000035.00000045.10000035.00000028.000000
25%2019.03.0000008.00000054.00000054.00000049.97500054.00000047.750000
50%2019.06.00000015.00000062.50000062.50000058.20000062.50000060.000000
75%2019.010.00000023.00000071.00000071.00000069.02500071.00000071.000000
max2019.012.00000031.000000117.000000117.00000077.40000092.00000095.000000

数据:

  1. count结果展示数据集是否有缺少值
  2. mean、std观察数据的分布特征
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
[datetime.datetime(2019, 1, 1, 0, 0),
 datetime.datetime(2019, 1, 2, 0, 0),
 datetime.datetime(2019, 1, 3, 0, 0),
 datetime.datetime(2019, 1, 4, 0, 0),
 datetime.datetime(2019, 1, 5, 0, 0)]
  • 1
  • 2
  • 3
  • 4
  • 5
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight') # 绘图风格
  • 1
  • 2
  • 3

可视化

布局视图,初步分析需要四项指标:分别为最高气温的标签值、前天、昨天、朋友预测的气温最高值,四个图。

fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
fig.autofmt_xdate(rotation=45)
# 最高气温的标签值
ax1.plot(dates,features['actual'])
ax1.set_xlabel('');ax1.set_ylabel('Temperature');ax1.set_title('Max Temp')
# 昨天的最高温度值
ax2.plot(dates,features['temp_1'])
ax2.set_xlabel('');ax2.set_ylabel('Temperature');ax2.set_title('Yesterday Max Temp')
# 前天的最高温度值
ax3.plot(dates,features['temp_2'])
ax3.set_xlabel('');ax3.set_ylabel('Temperature');ax3.set_title('Two Days Prior Max Temp')
# 朋友预测的最高温度值
ax4.plot(dates,features['friend'])
ax4.set_xlabel('');ax4.set_ylabel('Temperature');ax4.set_title('Friend Forcast')
plt.tight_layout(pad=2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

在这里插入图片描述

预处理

对于一周的标签,并不是数值特征,需要将其转换为特征编码

# 独热编码
features = pd.get_dummies(features) #自动转换,自动添加后缀
features.head(5)
  • 1
  • 2
  • 3
yearmonthdaytemp_2temp_1averageactualfriendweek_Friweek_Monweek_Satweek_Sunweek_Thursweek_Tuesweek_Wed
0201911454545.645291000000
1201912444545.744610010000
2201913454445.841560001000
3201914444145.940530100000
4201915414046.044410000010
print(help(pd.get_dummies))
  • 1
Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame'
    Convert categorical variable into dummy/indicator variables.
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
        Data of which to get dummy indicators.
    prefix : str, list of str, or dict of str, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : str, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix`.
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy-encoded columns should be backed by
        a :class:`SparseArray` (True) or a regular NumPy array (False).
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
        .. versionadded:: 0.23.0
    
    Returns
    -------
    DataFrame
        Dummy-coded data.
    
    See Also
    --------
    Series.str.get_dummies : Convert Series to dummy codes.
    
    Examples
    --------
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0

None
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 特征中去除标签
features = features.drop('actual',axis=1) # 按照列去掉
# 名字单独保留
feature_list = list(features.columns)
# 转换为合适的格式
features = np.array(features)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=42)
print('训练集特征:',train_features.shape)
print('训练集标签:',train_labels.shape)
print('测试集标签:',test_features.shape)
print('测试机标签:',test_labels.shape)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
训练集特征: (261, 14)
训练集标签: (261,)
测试集标签: (87, 14)
测试机标签: (87,)
  • 1
  • 2
  • 3
  • 4

随机森林回归模型

先建立1000棵决策树尝试,参数默认

先使用MAPE指标进行评估,这个评估是平均绝对百分误差,由于样本量比较少,训练模型的速率较快

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000,random_state=42)
rf.fit(train_features,train_labels)
predictions = rf.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * (errors / test_labels)
print('MAPE:',np.mean(mape))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
MAPE: 6.016378550202468
  • 1

树模型可视化

from sklearn.tree import export_graphviz
import pydot
import os
# os.environ["PATH"] += os.pathsep + 'S:/Graphviz/bin/'
  • 1
  • 2
  • 3
  • 4
tree = rf.estimators_[5]
export_graphviz(tree,out_file="tree.dot",feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('./tree.dot')
graph.write_png('./tree.png')
  • 1
  • 2
  • 3
  • 4

在这里插入图片描述

tree.png
# 限制树模型
rf_small = RandomForestRegressor(n_estimators=10,max_depth=3,random_state=42)
rf_small.fit(train_features,train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small,out_file='small_tree.dot',feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

在这里插入图片描述

small_tree.png
# 决策树特征重要性
importances = list(rf.feature_importances_)
# 格式转换
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)]
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印
[print('Variable:{:20} importance: {}'.format(*pair)) for pair in feature_importances]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
Variable:temp_1               importance: 0.69
Variable:average              importance: 0.2
Variable:day                  importance: 0.03
Variable:friend               importance: 0.03
Variable:temp_2               importance: 0.02
Variable:month                importance: 0.01
Variable:year                 importance: 0.0
Variable:week_Fri             importance: 0.0
Variable:week_Mon             importance: 0.0
Variable:week_Sat             importance: 0.0
Variable:week_Sun             importance: 0.0
Variable:week_Thurs           importance: 0.0
Variable:week_Tues            importance: 0.0
Variable:week_Wed             importance: 0.0





[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
# 绘制为直方图
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation='vertical')
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance');plt.xlabel('Variable');plt.title('Variable Importances')
  • 1
  • 2
  • 3
  • 4
  • 5
Text(0.5, 1.0, 'Variable Importances')
  • 1

在这里插入图片描述

观察发现,使用重要性最好的特征建模,可能会有更好的结果,虽然不一定成功,但是速度一定会更快

# 尝试使用最重要的两个特征
rf_most_important = RandomForestRegressor(n_estimators=1000,random_state=42)
# 最重要特征
important_indices = [feature_list.index('temp_1'),feature_list.index('average')]
train_important = train_features[:,important_indices]
test_important = test_features[:,important_indices]
# 重新训练模型
rf_most_important.fit(train_important,train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions-test_labels)
# 评估结果,保留两位小数
print('Mean Absolute Error:',round(np.mean(errors),2),'%')
mape = np.mean(100*(errors/test_labels))
print('mape:',mape)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
Mean Absolute Error: 3.92 %
mape: 6.243108595734665
  • 1
  • 2

这里发现,mape的值从6.0上升到6.2,并没有下降,说明不能只选择最重要的特征

首次预测

# 日期
months = features[:,feature_list.index('month')]
days = features[:,feature_list.index('day')]
years = features[:,feature_list.index('year')]
# 转换日期
dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
# 创建表格保存日期和其对应的标签数据
true_data = pd.DataFrame(data={'date':dates,'actual':labels})
# 另一个表格表示日期和对应预测值
months = test_features[:,feature_list.index('month')]
days = test_features[:,feature_list.index('day')]
years = test_features[:,feature_list.index('year')]
test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)]
test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates]
predictions_data = pd.DataFrame(data = {'date':test_dates,'prediction':predictions})
# 真实值
plt.plot(true_data['date'],true_data['actual'],'b-',label='actual')
# 预测值
plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label='prediction')
plt.xticks(rotation='60')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
(array([17897., 17956., 18017., 18078., 18140., 18201., 18262.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

在这里插入图片描述

深入数据分析

  1. 如果可以利用的数据量增大,会对结果产生什么影响呢
  2. 加入的新特征会改进模型效果吗,此时的时间效率又怎么样

新数据集预测

import pandas as pd
features = pd.read_csv('temps_extended.csv')
features.head(5)
  • 1
  • 2
  • 3
yearmonthdayweekdayws_1prcp_1snwd_1temp_2temp_1averageactualfriend
0201111Sat4.920.000363745.64040
1201112Sun5.370.000374045.73950
2201113Mon6.260.000403945.84242
3201114Tues5.590.000394245.93859
4201115Wed3.800.030423846.04539
print('数据规模',features.shape)
  • 1
数据规模 (2191, 12)
  • 1

日期处理

# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
[datetime.datetime(2011, 1, 1, 0, 0),
 datetime.datetime(2011, 1, 2, 0, 0),
 datetime.datetime(2011, 1, 3, 0, 0),
 datetime.datetime(2011, 1, 4, 0, 0),
 datetime.datetime(2011, 1, 5, 0, 0)]
  • 1
  • 2
  • 3
  • 4
  • 5

特征可视化

# 对新特征进行可视化展示
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
fig.autofmt_xdate(rotation=45)
# 平均最高气温
ax1.plot(dates,features['average'])
ax1.set_xlabel('');ax1.set_ylabel('Tempertature (F)');ax1.set_title('Historical Avg Max Temp')
# 风速
ax2.plot(dates,features['ws_1'],'r-')
ax2.set_xlabel('');ax2.set_ylabel('Wind Speed (mph))');ax2.set_title('Prior Wind Speed')
# 降水
ax3.plot(dates,features['prcp_1'],'r-')
ax3.set_xlabel('Date');ax3.set_ylabel('Precipitation (in)');ax3.set_title('Prior Precipitation')
# 积雪
ax4.plot(dates,features['snwd_1'],'ro')
ax4.set_xlabel('Date');ax4.set_ylabel('Snow Depth (in)');ax4.set_title('Prior Snow Depth')

plt.tight_layout(pad=2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

在这里插入图片描述

特征工程

  1. 尽可能多的选择有价值的特征,初始阶段的信息越多,建模可以利用的信息就越多
  2. 通常分析过程中可能会发现新的可以利用的特征,这个时候又放回去加入,过程比较繁琐,同时还需要实验进行对比。
  3. 因此需要在初始阶段就尽可能完善预处理与特征提取工作

关于这份数据:

  1. 天气变换与季节因素有关,然而数据集中并没有体现季节的特征,可以自己创建
# 季节变量
seasons = []
for month in features['month']:
    if month in [1,2,12]:
        seasons.append('winter')
    elif month in [3,4,5]:
        seasons.append('spring')
    elif month in [6,7,8]:
        seasons.append('summer')
    elif month in [9,10,11]:
        seasons.append('fall')
reduced_features = features[['temp_1','prcp_1','average','actual']]
reduced_features['season'] = seasons
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
<ipython-input-23-b690764ce900>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_features['season'] = seasons
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
import seaborn as sns
sns.set(style='ticks',color_codes=True)
# 主题
palette = sns.xkcd_palette(['dark blue','dark green','gold','orange'])
# pairplot绘图
sns.pairplot(reduced_features,hue='season',diag_kind='kde',palette=palette,plot_kws=dict(alpha=0.7),diag_kws=dict(shade=True))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
<seaborn.axisgrid.PairGrid at 0x232067fa190>
  • 1

在这里插入图片描述

分析的方法:不同的颜色代表不同的季节(hue),主对角线是同一个参数,表示数值在不同季节的分布。其他都是散点图,代表了各个特征之间的相关性

数据量对结果的影响分析

# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual',axis=1)
# 特征名字留着备用
feature_list = list(features.columns)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
# 转换为所需格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=0)
print("训练集特征:",train_features.shape)
print("训练集标签:",train_labels.shape)
print("测试集特征:",test_features.shape)
print("测试集标签:",test_labels.shape)
  • 1
  • 2
  • 3
  • 4
  • 5
训练集特征: (1643, 17)
训练集标签: (1643,)
测试集特征: (548, 17)
测试集标签: (548,)
  • 1
  • 2
  • 3
  • 4

虽然是新样本,但是需要使用相同的测试集来对比结果。

# 对之前数据惊喜能够处理,对比
import pandas as pd
import numpy as np
# 统一特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if feature not in ['ws_1','prcp_1','snwd_1']]
# 重新读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 数据标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual',axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

先前数据切分

# 数据集切分
from sklearn.model_selection import train_test_split
original_train_features,original_test_features,original_train_labels,original_test_labels = train_test_split(original_features,original_labels,test_size=0.25,random_state=42)
# 数据建模
from sklearn.ensemble import RandomForestRegressor
# 同样参数和随机种子
rf = RandomForestRegressor(n_estimators=100,random_state=0)
# 老数据集
rf.fit(original_train_features,original_train_labels)
# 统一使用一个测试集,为了公平
predictions = rf.predict(test_features[:,original_feature_indices])
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),'°')
mape = 100 *(errors/test_labels)
# 为了观察设定准确率
accuracy = 100 -np.mean(mape)
print('Accuracy:',round(accuracy,2),'%')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
平均温度误差: 4.68 °
Accuracy: 92.19 %
  • 1
  • 2

上面是样本较少的结果,这次观察样本增多会更好吗

from sklearn.ensemble import RandomForestRegressor
# 保证标签一致 这里改变了原白哦钱的维度方便计算
original_train_changeed_features = train_features[:,original_feature_indices]
original_test_changed_features = test_features[:,original_feature_indices]
rf = RandomForestRegressor(n_estimators=100,random_state=0)
rf.fit(original_train_changeed_features,train_labels)
# 预测
baseline_predictions = rf.predict(original_test_changed_features)
# 结果
baseline_errors = abs(baseline_predictions-test_labels)
print('平均温度误差:',round(np.mean(baseline_errors),2),'%')
baseline_mape = 100 * np.mean(baseline_errors/test_labels)
# 准确率
baseline_accuracy = 100 - baseline_mape
print('Accuracy:',round(baseline_accuracy,2),'%')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
平均温度误差: 4.2 %
Accuracy: 93.12 %
  • 1
  • 2

特征的数量对结果影响

加入两次比较没有比较的新的天气特征

from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100,random_state=0)
rf_exp.fit(train_features,train_labels)
# 同一测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:',round(np.mean(errors),2),"%")
mape = np.mean(100*(errors/test_labels))
improvement_baseline = 100 * abs(mape-baseline_mape) / baseline_mape
print('特征增多以后模型效果变化:',round(improvement_baseline,2),'%')
# 准确率
accuracy = 100 - mape
print('Accuracy:',round(accuracy,2),'%')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
平均温度误差: 4.05 %
特征增多以后模型效果变化: 3.34 %
Accuracy: 93.35 %
  • 1
  • 2
  • 3
importances = list(rf_exp.feature_importances_)
# 名字和数值拼接在一起
feature_importances = [(feature,round(importance,2)) for feature,importance in zip(feature_list,importances)]
# 排序
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印结果
[print('Variable:{:20} Importance: {}'.format(*pair)) for pair in feature_importances]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
Variable:temp_1               Importance: 0.85
Variable:average              Importance: 0.05
Variable:ws_1                 Importance: 0.02
Variable:friend               Importance: 0.02
Variable:year                 Importance: 0.01
Variable:month                Importance: 0.01
Variable:day                  Importance: 0.01
Variable:prcp_1               Importance: 0.01
Variable:temp_2               Importance: 0.01
Variable:snwd_1               Importance: 0.0
Variable:weekday_Fri          Importance: 0.0
Variable:weekday_Mon          Importance: 0.0
Variable:weekday_Sat          Importance: 0.0
Variable:weekday_Sun          Importance: 0.0
Variable:weekday_Thurs        Importance: 0.0
Variable:weekday_Tues         Importance: 0.0
Variable:weekday_Wed          Importance: 0.0





[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
# 可视化重要指标
plt.style.use('fivethirtyeight')
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation="vertical",color="r",edgecolor="k",linewidth=1.2)
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
Text(0.5, 1.0, 'Variable Importances')
  • 1

在这里插入图片描述

具体选择几个特征还是比较模糊,可以先把特征按照其重要性进行排序后再计算累计值,设置阈值,观察需要多少特征累加在一起后,特征重要性的累加值才能超过阈值

sorted_importances = [importance[1] for importance  in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values,cumulative_importances,'g-')
plt.hlines(y=0.95,xmin=0,xmax=len(sorted_importances),color='r',linestyles='dashed')
plt.xticks(x_values,sorted_features,rotation='vertical')
plt.xlabel('Variable');plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
Text(0.5, 1.0, 'Cumulative Importances')
  • 1

在这里插入图片描述

当使用前五个特征,总体的准确率累加值超过0.95,如果只使用这5个特征建模,观察结果

important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]
# 训练集
important_train_features = train_features[:,important_indices]
important_test_features = test_features[:,important_indices]
# 数据维度
print("important train features shape:",important_train_features.shape)
print("important test features shape:",important_test_features.shape)
# 训练模型
rf_exp.fit(important_train_features,train_labels)
# 同样的测试集
predictions = rf_exp.predict(important_test_features)
# 评估
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),"°")
mape = 100*(errors/test_labels)
accuracy = 100 - np.mean(mape)
print('Accuracy:',round(accuracy,2),"%")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
important train features shape: (1643, 5)
important test features shape: (548, 5)
平均温度误差: 4.11 °
Accuracy: 93.28 %
  • 1
  • 2
  • 3
  • 4

虽然没有提升效率,那么观察一下在模型时间效率上面有没有提高·

import time
all_features_time = []
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(train_features,train_labels)
    all_features_predictions = rf_exp.predict(test_features)
    end_time = time.time()
    all_features_time.append(end_time-start_time)
    
all_features_time = np.mean(all_features_time)
print("使用所有特征与测试的平均时间消耗:",round(all_features_time,2),'s')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
使用所有特征与测试的平均时间消耗: 0.69 s
  • 1
# 只选用重要特征训练时
reduced_features_time = []
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(important_train_features,train_labels)
    reduced_features_predictions = rf_exp.predict(important_test_features)
    end_time = time.time()
    reduced_features_time.append(end_time-start_time)
    
reduced_features_time = np.mean(reduced_features_time)
print("使用重要特征与测试的平均时间消耗:",round(reduced_features_time,2),'s')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
使用重要特征与测试的平均时间消耗: 0.42 s
  • 1
# 原始模型时间效率
original_features_time =[]
for _ in range(10):
    start_time =time.time()
    rf.fit(original_train_features,original_train_labels)
    original_features_predictions =rf.predict(test_features[:,original_feature_indices])
    end_time =time.time()
    original_features_time.append(end_time -start_time)
original_features_time =np.mean(original_features_time)

print("使用原始模型测试的平均时间消耗:",round(original_features_time,2),'s')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
使用原始模型测试的平均时间消耗: 0.18 s
  • 1
# 对比展示
all_accuracy = 100 * (1-np.mean(abs(all_features_predictions-test_labels)/test_labels))
reduced_accuracy = 100 * (1-np.mean(abs(reduced_features_predictions-test_labels)/test_labels))

# 保存结果并展示
comparision = pd.DataFrame({'features':['all(17)','reduced(5)'],
                           'runtime':[round(all_features_time,2),round(reduced_features_time,2)],
                           'accuracy':[round(all_accuracy,2),round(reduced_accuracy,2)]})
comparision[['features','accuracy','runtime']]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
featuresaccuracyruntime
0all(17)93.350.69
1reduced(5)93.280.42

单独两个结果比较

准确率只是为了方面安眠自己定义,用于对比分析。结果如上表所见,准确率基本没有发生明显变化,时间效率上倒是有很大的提升。

# 时间效率可能会比准确率更加优先考虑
relative_accuracy_decrease =  100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:',round(relative_accuracy_decrease,3),"%")
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print("相对时间效率提升:",round(relative_runtime_decrease,3),"%")
  • 1
  • 2
  • 3
  • 4
  • 5
相对accuracy提升: 0.071 %
相对时间效率提升: 39.17 %
  • 1
  • 2

各个实验结果

# 原模型的预测温度对比
original_mae = np.mean(abs(original_features_predictions -test_labels))
# 所有特征预测温度对比
exp_all_mae = np.mean(abs(all_features_predictions -test_labels))
# 重要特征预测温度对比
exp_reduced_mae = np.mean(abs(reduced_features_predictions -test_labels))
# 原模型的准确率
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) /test_labels))
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'],
                                'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],
                                'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
                                'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
# 汇聚所有实验结果
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3,figsize=(16,5),sharex=True)
# X轴
x_values = [0,1,2]
labels = list(model_comparison['model'])
plt.xticks(x_values,labels)
# 字体大小
fontdict = {'fontsize':18}
fontdict_yaxis = {'fontsize':14}
# 预测温度和真实温度的比对比
ax1.bar(x_values,model_comparison['error (degrees)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax1.set_ylim(bottom=3.5, top=4.5)
ax1.set_ylabel('Error (degree) (F)',fontdict=fontdict_yaxis)
ax1.set_title('Model Error Comparison',fontdict=fontdict)
# 准确率对比
ax2.bar(x_values,model_comparison['accuracy'],color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax2.set_ylim(bottom=92, top=94)
ax2.set_ylabel('Accuracy (%)',fontdict=fontdict_yaxis)
ax2.set_title('Model Accuracy Comparision',fontdict=fontdict)
# 时间效率对比
ax3.bar(x_values,model_comparison['run_time (s)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax3.set_ylim(bottom=0,top=1)
ax3.set_ylabel('run_time (s)',fontdict=fontdict_yaxis)
ax3.set_title('Model Run-Time Comparison',fontdict=fontdict)
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

在这里插入图片描述

模型调参

from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
rf = RandomForestRegressor(random_state=42)
pprint(rf.get_params())
  • 1
  • 2
  • 3
  • 4
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

自动随机调参

所选择的参数值的候选根据实际情况,每一个参数的取值范围都需要好好把控

from sklearn.model_selection import RandomizedSearchCV
# 建立树的个数
n_estimators = [int(x) for x in np.linspace(start=200,stop=2000,num=10)]
# 最大特征的选择方法
max_features = ['auto','sqrt']
# 树最大深度
max_depth = [int(x) for x in np.linspace(10,20,num=2)]
max_depth.append(None)
# 节点最小分裂所需要的样本个数
min_samples_split = [2,5,10]
# 叶子节点最小的样本数
min_samples_leaf = [1,2,4]
# 样本采样方法
bootstrap = [True,False]
# 随机参数空间
random_grid ={'n_estimators':n_estimators,
             'max_features':max_features,
             'max_depth':max_depth,
             'min_samples_split':min_samples_split,
             'min_samples_leaf':min_samples_leaf,
             'bootstrap':bootstrap}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

随机组合参数

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, # 指定调参模型
                            param_distributions=random_grid, # 指定候选参数列表
                            n_iter=100, # 随机选择参数组合的个数,这里是随机选择100组,找这中间最好的
                            scoring='neg_mean_absolute_error', # 评估方法
                            cv=3, # 交叉验证
                            verbose=2, # 打印信息的数量
                            random_state=42, # 随机种子,随便选
                            n_jobs=-1) # 多线程数目,如果-1代表使用所有线程
# 寻找开始
rf_random.fit(train_features,train_labels)
rf_random.best_params_
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
# 评估结果
def evaluate(model,test_features,test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('平均气温误差:',np.mean(errors))
    print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features,train_labels)
evaluate(base_model,test_features,test_labels)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
# 新配方
best_random = rf_random.best_estimator_
evaluate(best_random,test_features,test_labels)
  • 1
  • 2
  • 3

网络参数搜索

地毯式搜索最优参数,注意这里的地毯也是根据之前的随机调参得到的大致范围

{'n_estimators': 1800,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': None,
 'bootstrap': True}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
from sklearn.model_selection import GridSearchCV
# 候选参数空间
param_grid = {
    'n_estimators':[1600,1700,1800,1900,2000],
    'max_features':['auto'],
    'max_depth':[8,10,12],
    'min_samples_split':[3,5,7],
    'min_samples_leaf':[2.3,4,5,6],
    'bootstrap':[True]
}
# 基本算法模型
rf = RandomForestRegressor()
# 网格搜索
grid_search = GridSearchCV(estimator=rf,
                          param_grid=param_grid,
                          scoring='neg_mean_absolute_error',
                          cv=3,
                          n_jobs=-1,
                          verbose=2)
# 搜索开始
grid_search.fit(train_features,train_labels)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

网格搜索的时候,如果参数空间过大,遍历次数过多时,将所有可能分成不同的小组进行,比较各个组的最优

贝叶斯优化

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score
def hyperopt_train_test(params):
    clf = RandomForestRegressor(**params)
    return cross_val_score(clf,train_features,train_labels).mean()
max_depth = [i for i in range(10,20)]
# max_depth.append(None)
space4rf = {
    'max_depth': hp.choice('max_depth', max_depth),
    'max_features': hp.choice('max_features', ['auto','sqrt']),
    'min_samples_split':hp.choice('min_samples_split',range(5,20)),
    'min_samples_leaf':hp.choice('min_samples_leaf',range(2,10)),
    'n_estimators': hp.choice('n_estimators', range(1000,2000)),
    'bootstrap':hp.choice('bootstrap',[True,False])
}

best = 0
def f(params):
    global best
    acc = hyperopt_train_test(params)
    if acc > best:
        best = acc
        print('new best:', best, params)
    return {'loss': -acc, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space4rf, algo=tpe.suggest, max_evals=100, trials=trials)
print("best:",best)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
new best:                                                                                                              
0.8670547390932652                                                                                                     
{'bootstrap': True, 'max_depth': 18, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 11, 'n_estimators': 1942}
new best:                                                                                                              
0.8679298658104889                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 7, 'min_samples_split': 10, 'n_estimators': 1734}
new best:                                                                                                              
0.8684034111523674                                                                                                     
{'bootstrap': True, 'max_depth': 14, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 19, 'n_estimators': 1766}
new best:                                                                                                              
0.8685636302610934                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1439}
new best:                                                                                                              
0.8685767383919801                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1404}
new best:                                                                                                              
0.8685919671759731                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 15, 'n_estimators': 1830}
new best:                                                                                                              
0.8686049353034605                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1099}
new best:                                                                                                              
0.8686240725941452                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1088}
100%|█████████████████████████████████████████████| 100/100 [39:13<00:00, 26.47s/trial, best loss: -0.8686240725941452]
best: {'bootstrap': 0, 'max_depth': 9, 'max_features': 0, 'min_samples_leaf': 7, 'min_samples_split': 11, 'n_estimators': 88}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/843984
推荐阅读
相关标签
  

闽ICP备14008679号