赞
踩
欢迎大家来我的个人博客网站观看原文:https://xkw168.github.io/2019/05/20/时间序列预测-三-Xgboost模型.html
Xgboost(Extreme Gradient Boost)模型,是一种特殊的梯度提升决策树(GBDT,Gradient Boosting Decision Tree),只不过是力求将速度和效率发挥到了极致,故叫X(Extreme)gradientboost。Xgboost其本质上还是基于树结构并结合集成学习的一种方法,其基础树结构为分类回归树(CART,Classification and Regression Tree)。类似于局部加权线性回归算法,基于树的回归算法也是一类局部的回归算法,通过将数据集切分成多份,在每一份数据上单独建模。但不同的是基于树的回归算法是一种基于参数的学习算法,利用训练数据训练完模型后,参数一旦确定,无需再改变。分类回归树是一种基于决策树的结构,既可以用于解决分类问题也可以用于解决回归问题,是国际权威的学术组织The IEEE International Conference on DataMining (ICDM)早前评选出了数据挖掘领域的十大经典算法之一。
CART算法核心内容包含以下三个方面:
决策树的生成就是递归地构建二叉决策树的过程,核心思想为在训练数据集所在的输入空间中,递归地将每个区域划分为两个子区域并决定每个子区域上输出值。划分子区域的标准取决于树的种类,对回归树用平方误差最小化准则,对分类树用基尼指数最小化准则。回归树的生成具体步骤如下:
m
i
n
[
m
i
n
∑
x
i
∈
R
1
(
j
,
s
)
(
y
i
−
c
1
)
2
+
m
i
n
∑
x
j
∈
R
2
(
j
,
s
)
(
y
j
−
c
2
)
2
]
min[min{\sum_{x_i \in R_1(j,s)} (y_i-c_1)^2} + min{\sum_{x_j \in R_2(j,s)} (y_j-c_2)^2}]
min[minxi∈R1(j,s)∑(yi−c1)2+minxj∈R2(j,s)∑(yj−c2)2]
遍历变量j,对固定的切分变量j扫描切分点s,选择使上式最小值的对
(
j
;
s
)
(j; s)
(j;s)。其中
R
m
R_m
Rm是被划分的输入空间,
c
m
c_m
cm是空间
R
m
R_m
Rm对应的输出值
R 1 ( j , s ) = { x ∣ x ( j ) ≤ s } , R 2 ( j , s ) = { x ∣ x ( j ) > s } c ^ m = 1 N m ∑ x i ∈ R m ( j , s ) y i , x ∈ R m , m = 1 , 2 R_1(j,s)=\{x|x^{(j)} \leq s\}, R_2(j,s)=\{x|x^{(j)} > s\} \\ \hat{c}_m = \frac{1}{N_m} \sum_{x_i \in R_m(j,s)} y_i, x \in R_m, m = 1,2 R1(j,s)={x∣x(j)≤s},R2(j,s)={x∣x(j)>s}c^m=Nm1xi∈Rm(j,s)∑yi,x∈Rm,m=1,2
f ( x ) = ∑ m = 1 M c ^ m I ( x ∈ R m ) f(x) = \sum_{m=1}^{M} \hat{c}_m I(x \in R_m) f(x)=m=1∑Mc^mI(x∈Rm)
L ( y , f ( x ) ) = ∑ x i ∈ R m ( y i − f ( x i ) ) 2 L(y, f(x)) = \sum_{x_i \in R_m} {(y_i - f(x_i))}^2 L(y,f(x))=xi∈Rm∑(yi−f(xi))2
单棵分类回归树精度有限,应用场景受限,故Xgboost在CART的基础上引入了集成学习(boosting方法),并采用并行计算等方式极大的加速了模型计算速度。Boosting的核心思想就是所有弱分类器的结果相加等于预测值,然后下一个弱分类器去拟合误差函数对预测值的梯度/残差(这个梯度/残差就是预测值与真实值之间的误差),从而不断地减小残差,直到满足系统的误差要求(如图所示)
pip install xgboost
def xgboost_predict(train_data, evaluation_data, forecast_cnt=365, freq="D", importance_fig=False, model_dir=""): """ predict time series with XGBoost library which is based on Gradient Boost and CART(classification and regression tree) :param train_data: data use to train the model :param evaluation_data: data use to evaluate the model :param forecast_cnt: how many point needed to be predicted :param freq: the interval between time index :param importance_fig: whether plot importance of each feature :param model_dir: directory of pre-trained model(checkpoint, params) :return: """ def create_features(df, label=None): """ Creates time series features from datetime index """ df['date'] = df.index df['hour'] = df['date'].dt.hour df['dayofweek'] = df['date'].dt.dayofweek df['quarter'] = df['date'].dt.quarter df['month'] = df['date'].dt.month df['year'] = df['date'].dt.year df['dayofyear'] = df['date'].dt.dayofyear df['dayofmonth'] = df['date'].dt.day df['weekofyear'] = df['date'].dt.weekofyear X = df[['hour', 'dayofweek', 'quarter', 'month', 'year', 'dayofyear', 'dayofmonth', 'weekofyear']] if label: y = df[label] return X, y return X model_directory = "./model/XGBoost_%s" % now() params = { } # if there is a pre-trained model, use parameters from it if model_dir: model_directory = model_dir latest_date = evaluation_data["ds"].tolist()[-1] # set index with datetime train_data = train_data.set_index(pd.DatetimeIndex(train_data["ds"])) evaluation_data = evaluation_data.set_index(pd.DatetimeIndex(evaluation_data["ds"])) forecast_data = pd.DataFrame.from_dict({ "ds": generate_time_series(start_date=latest_date, cnt=forecast_cnt, delta=delta_dict[freq]) }) forecast_data = forecast_data.set_index(pd.DatetimeIndex(forecast_data["ds"])) x_train, y_train = create_features(train_data, label='y') x_eval, y_eval = create_features(evaluation_data, label="y") x_forecast = create_features(forecast_data) reg = XGBRegressor(n_estimators=1000) if model_dir: reg.load_model(model_directory) else: reg.fit(x_train, y_train, eval_set=[(x_train, y_train), (x_eval, y_eval)], early_stopping_rounds=50, verbose=False) # Change verbose to True if you want to see it train reg.save_model(model_directory) if importance_fig: plot_importance(reg, height=0.9) evaluation_data["y"] = reg.predict(x_eval) forecast_data["y"] = reg.predict(x_forecast) return evaluation_data, forecast_data
注意这里假设只有一个时间序列,实际情况中,可能存在多个序列与待分析的时间序列相关,可以模仿这里的写法将其拓展为多特征分析与预测(这里仅仅用到了时间这一特征)。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。