赞
踩
在选择一个模型进行训练后,如何判断模型的优势及优化模型的性能呢?一般可以从以下几个方面进行优化:
欠拟合(高偏差)、过拟合(高方差)和正常拟合三种学习曲线:
模型融合,即先产生一组个体学习器,再用某种策略将他们结合起来,以加强模型效果。随着集成中个体分类器数目的增加,集成学习器的错误率也会呈指数级下降,最终趋于零。
综合个体学习器的优势是能够降低预测误差,优化整体模型的性能。而且个体学习器的准确性越高,多样性越大,模型融合的提升效果越好。
按个体学习器的关系,模型融合技术可以分为两类:
(1)个体学习器不存在强依赖关系可同时生成的并行化方法,如Bagging和随机森林。
(2)个体学习器存在强依赖必须串行生成序列,如Boosting。
Boosting方法中著名的算法有AdaBoost算法和提升树( Boosting Tree)系列算法。在提升树系列算法中,应用最广泛的是梯度提升树(Gradient BoostingTree),下面逐一简要介绍:
(1) AdaBoost算法:是加法模型、损失函数为指数函数、学习算法为前向分布算法时的二分类算法。
(2)提升树:是加法模型、学习算法为前向分布算法时的算法,基本学习器限定为决策树。对于二分类问题,损失函数为指数函数,就是把AdaBoost算法中的基本学习器限定为二叉决策树;对于回归问题,损失函数为平方误差,此时拟合的是当前模型的残差.
(3)梯度提升树:是对提升树算法的改进。提升树算法只适合于误差函数为指数函数和平方误差,而对于一般的损失函数,梯度提升树算法可以将损失函数的负梯度在当前模型的值作为残差的近似值。
说明:GBDT算法有两种描述思路:一种是基于残差的版本(当损失函数是均方差损失时,因为对均方差求导后为
2
(
y
−
y
^
)
2(y-\hat{y})
2(y−y^)),也就是真实值和预测值的残差方法;另一种是基本梯度(gradient)的版本(当损失函数是其他损失函数时,如交叉熵损失)。简而言之,这两种思路都是基于梯度的,只不过均方差在求导后变成了残差。
import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import itertools from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from mlxtend.classifier import EnsembleVoteClassifier from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions clf1 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='auto') clf2 = RandomForestClassifier(random_state=0, n_estimators=100) clf3 = SVC(random_state=0, probability=True, gamma='auto') eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[2, 1, 1], voting='soft') x, y = iris_data() x = x[:, [0, 2]] gs = gridspec.GridSpec(1, 4) fig = plt.figure(figsize=(16, 4)) for clf, lab, grd in zip( [clf1, clf2, clf3, eclf], ['Logistic Regression', 'RandomForest', 'RBF kernel SVM', 'Ensemble'], itertools.product([0, 1], repeat=2)): clf.fit(x, y) ax = plt.subplot(gs[0, grd[0] * 2 + grd[1]]) fig = plot_decision_regions(X=x, y=y, clf=clf, legend=2) plt.title(lab) plt.show()
运行结果是当采用逻辑回归、随机森林、SVM模型及模型融合时,数据集的分类预测情况的可视化显示:
Ranking的思想和Averaging的一致,因为上述平均法存在一定的问题, 所以这里采用了把排名平均的方法。如果有权重,则求出n个模型权重比排名之和,即为最后的结果。
Blending的优点:Blending比Stacking简单(不用进行k次交叉验证来获得stacker feature),避开了一些信息泄露问题,因为generlizers 和 stacker使用了不一样的数据集。
Blending的缺点:
(1)使用了很少的数据(第二阶段的blender 只使用了训练集10%的数据量)。
(2)blender可能会过拟合。
说明:对于实践中的结果而言,Stacking和Blending的效果差不多。
Stacking 两层模型都使用了全部的训练集数据。
下面举例进一步说明:
(1)有训练集和测试集两组数据,并将训练集分成5份:train1,train2,train3,train4,train5.
(2)选定基模型。这里假定我们选择了xgboost,lightgbm,randomforest作为基模型。比如xgboost模型部分,依次用 train1,train2,train3,train4,train5作为验证集,其余4份作为训练集,然后5折交叉验证进行模型训练,再在测试集上进行预测。这样会得到在训练集上由xgboost模型训练出来的5份predictions和在测试集上的1份预测值B1,然后将5份predicticons纵向重叠合并起来得到AI。lightgbm 模型和randomforest模型部分同理。
(3)在三个基模型训练完毕后,将三个模型在训练集上的预测值分别作为3个“特征”A1,A2,A3,然后使用LR模型进行训练并建立LR模型。
(4)使用训练好的LR模型,在三个基模型的测试集上预测得到“特征”值(B1,B2,B3)的基础上进行预测,得出最终的预测类别或概率。
说明:在做Stacking的过程中,如果将第一层模型的预测值和原始特征合并加入第二层模型的训练中,则可以使模型的效果更好,还可以防止模型的过拟合。
通过Bad-Case分析,可以有效找到预测不准确的样本点,进而回溯分析数据,寻找相关的原因,从而找到提高模型精度的方法。
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning':0})
import seaborn as sns
(1)导入数据:
import pandas as pd
# 导入数据
train_data_file='../data/zhengqi_train.txt'
test_data_file='../data/zhengqi_test.txt'
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
(2)合并数据:
#合并训练数据和测试数据
train_data["oringin"] = "train" #用于区分训练集
test_data["oringin"] = "test"
data_all = pd.concat([train_data, test_data], axis=0, ignore_index=True)
#查看合并后以及新加列
print(data_all.columns)
输出结果
Index(['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30',
'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'target', 'oringin'],
dtype='object')
(3)探索数据分布
# 探索数据分布
dist_cols = 6
dist_rows = len(data_all.columns)
plt.figure(figsize=(2 * dist_cols, 2 * dist_rows))
i = 1
for col in data_all.columns[0:-2]:
ax = plt.subplot(dist_rows, dist_cols, i)
g = sns.kdeplot(data_all[col][(data_all["oringin"] == "train")], color="Red", shade=True)
g = sns.kdeplot(data_all[col][(data_all["oringin"] == "test")], color="Blue", shade=True)
ax.set_xlabel(col)
ax.set_ylabel("Frequency")
ax = ax.legend(["train", "test"])
i += 1
plt.show()
(4)数据清洗
#数据清洗
data_all.drop(["V5", "V9", "V11", "V17", "V22", "V28"], axis=1, inplace=True)
print(data_all.columns)
Index(['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', 'V8', 'V10', 'V12', 'V13',
'V14', 'V15', 'V16', 'V18', 'V19', 'V20', 'V21', 'V23', 'V24', 'V25',
'V26', 'V27', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36',
'V37', 'target', 'oringin'],
dtype='object')
(5)特征可视化
from scipy import stats #特征可视化 data_train1 = data_all[data_all["oringin"] == "train"].drop("oringin", axis=1) fcols = 6 frows = len(train_data.columns) plt.figure(figsize=(2 * fcols, 2 * frows)) i = 0 for col in data_train1.columns: i += 1 ax = plt.subplot(frows, fcols, i) sns.regplot(x=col, y='target', data=train_data, ax=ax, scatter_kws={'marker': '.', 's': 3, 'alpha': 0.3}, line_kws={'color': 'k'}); plt.xlabel(col) plt.ylabel('target') i += 1 ax = plt.subplot(frows, fcols, i) sns.distplot(train_data[col].dropna(), fit=stats.norm) plt.xlabel(col) plt.show()
(6)计算相关系数
import numpy as np
# 找出相关程度
plt.figure(figsize=(20, 16)) # 指定绘图对象宽度和高度
colnm = data_train1.columns.tolist() # 列表头
mcorr = data_train1[colnm].corr(method="spearman") # 相关系数矩阵,即给出了任意两个变量之间的相关系数
mask = np.zeros_like(mcorr, dtype=np.bool) # 构造与mcorr同维数矩阵 为bool型
mask[np.triu_indices_from(mask)] = True # 角分线右侧为True
cmap = sns.diverging_palette(220, 10, as_cmap=True) # 返回matplotlib colormap对象
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f') # 热力图(看两两相似度)
plt.show()
#删除小于0.2相关性系数的特征
threshold = 0.2
corr_matrix = data_train1.corr().abs()
drop_col = corr_matrix[corr_matrix["target"] < threshold].index
data_all.drop(drop_col, axis=1, inplace=True)
print(data_all.columns)
Index(['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', 'V8', 'V10', 'V12', 'V13',
'V16', 'V20', 'V23', 'V24', 'V27', 'V31', 'V36', 'V37', 'target',
'oringin'],
dtype='object')
(7)归一化
#归一化
cols_numeric = list(data_all.columns)
cols_numeric.remove("oringin")
def scale_minmax(col):
return (col - col.min()) / (col.max() - col.min())
scale_cols = [col for col in cols_numeric if col != 'target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax, axis=0)
print(data_all[scale_cols].describe())
(8)Box-Cox变换
#Box-cox变换 fcols = 6 frows = len(cols_numeric) - 1 plt.figure(figsize=(2 * fcols, 2 * frows)) i = 0 for var in cols_numeric: if var != 'target': dat = data_all[[var, 'target']].dropna() i += 1 plt.subplot(frows, fcols, i) sns.distplot(dat[var], fit=stats.norm) plt.title(var + 'Original') plt.xlabel('') i += 1 plt.subplot(frows, fcols, i) _ = stats.probplot(dat[var], plot=plt) plt.title('skew=' + '{:.4f}'.format(stats.skew(dat[var]))) plt.xlabel('') plt.ylabel('') i += 1 plt.subplot(frows, fcols, i) plt.plot(dat[var], dat['target'], '.', alpha=0.5) plt.title('corr=' + '{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1])) i += 1 plt.subplot(frows, fcols, i) trans_var, lambda_var = stats.boxcox(dat[var].dropna() + 1) trans_var = scale_minmax(trans_var) sns.distplot(trans_var, fit=stats.norm) plt.title(var + 'Tramsformed') plt.xlabel('') i += 1 plt.subplot(frows, fcols, i) stats.probplot(trans_var, plot=plt) plt.title('skew=' + '{:.4f}'.format(stats.skew(trans_var))) plt.xlabel('') plt.ylabel('') i += 1 plt.subplot(frows, fcols, i) plt.plot(trans_var, dat['target'], '.', alpha=0.5) plt.title('corr=' + '{:.2f}'.format(np.corrcoef(trans_var, dat['target'])[0][1])) plt.show() cols_transform = data_all.columns[0:-2] for col in cols_transform: data_all.loc[:, col], _ = stats.boxcox(data_all.loc[:, col] + 1)
(9)标签数据对数变换,使数据更符合正态分布
#目标值处理
print(data_all.target.describe())
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.distplot(data_all.target.dropna(), fit=stats.norm)
plt.subplot(1, 2, 2)
_ = stats.probplot(data_all.target.dropna(), plot=plt)
plt.show()
#对数变换target目标值提升正态性
sp = train_data.target
train_data.target1 = np.power(1.5, sp)
print(train_data.target1.describe())
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.distplot(train_data.target1.dropna(), fit=stats.norm);
plt.subplot(1, 2, 2)
_ = stats.probplot(train_data.target1.dropna(), plot=plt)
plt.show()
使用简单交叉验证方法对模型进行验证,划为训练数据为70%,验证数据为30%。
from sklearn.model_selection import train_test_split #使用简单交叉验证方法对模型验证,划分训练数据集为70%,验证数据集30% def get_training_data(): df_train = data_all[data_all["oringin"] == "train"] df_train["label"] = train_data.target1 y = df_train.target X = df_train.drop(["oringin", "target", "label"], axis=1) X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=100) return X_train, X_valid, y_train, y_valid def get_test_data(): df_test = data_all[data_all["oringin"] == "test"].reset_index(drop=True) return df_test.drop(["oringin", "target"], axis=1)
将RMSE和MSE作为模型性能的评价指标。
from sklearn.metrics import mean_squared_error, make_scorer def rmse(y_true, y_pred): diff = y_pred - y_true sum_sq = sum(diff ** 2) n = len(y_pred) return np.sqrt(sum_sq / n) def mse(y_ture, y_pred): return mean_squared_error(y_ture, y_pred) rmse_scorer = make_scorer(rmse, greater_is_better=False) mse_scorer = make_scorer(mse, greater_is_better=False)
绘图查看异常数据分布:
#异常值过滤 # function to detect outliers based on the predictions of a model def find_outliers(model, X, y, sigma=3): # predict y values using model model.fit(X, y) y_pred = pd.Series(model.predict(X), index=y.index) # calculate residuals between the model prediction and true y values resid = y - y_pred mean_resid = resid.mean() std_resid = resid.std() # calculate z statistic, define outliers to be where |z|>sigma z = (resid - mean_resid) / std_resid outliers = z[abs(z) > sigma].index # print and plot the results print('R2=', model.score(X, y)) print('rmse=', rmse(y, y_pred)) print("mse=", mean_squared_error(y, y_pred)) print('---------------------------------------') print('mean of residuals:', mean_resid) print('std of residuals:', std_resid) print('---------------------------------------') print(len(outliers), 'outliers:') print(outliers.tolist()) plt.figure(figsize=(15, 5)) ax_131 = plt.subplot(1, 3, 1) plt.plot(y, y_pred, '.') plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro') plt.legend(['Accepted', 'Outlier']) plt.xlabel('y') plt.ylabel('y_pred'); ax_132 = plt.subplot(1, 3, 2) plt.plot(y, y - y_pred, '.') plt.plot(y.loc[outliers], y.loc[outliers] - y_pred.loc[outliers], 'ro') plt.legend(['Accepted', 'Outlier']) plt.xlabel('y') plt.ylabel('y - y_pred'); ax_133 = plt.subplot(1, 3, 3) z.plot.hist(bins=50, ax=ax_133) z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133) plt.legend(['Accepted', 'Outlier']) plt.xlabel('z') plt.savefig('outliers.png') return outliers
提取数据:
from sklearn.linear_model import Ridge # get training data X_train, X_valid, y_train, y_valid = get_training_data() test = get_test_data() # find and remove outliers using a Ridge model outliers = find_outliers(Ridge(), X_train, y_train) X_outliers = X_train.loc[outliers] y_outliers = y_train.loc[outliers] X_t = X_train.drop(outliers) y_t = y_train.drop(outliers) R2= 0.8727162081535336 rmse= 0.3545578094049105 mse= 0.1257112402100088 --------------------------------------- mean of residuals: 1.5601352745845842e-17 std of residuals: 0.35464556037999617 --------------------------------------- 22 outliers: [843, 2159, 1164, 2863, 1145, 2697, 2528, 2645, 691, 1905, 1874, 2647, 884, 2696, 2668, 1310, 1901, 2769, 2263, 2002, 2669, 1040]
定义方法获取去除异常值的训练数据,这里采用深copy:
def get_trainning_data_omitoutliers():
y=y_t.copy()
X=X_t.copy()
return X,y
from sklearn.model_selection import RepeatedKFold, GridSearchCV, cross_val_score def train_model(model, param_grid=[], X=[], y=[], splits=5, repeats=5): # 获取数据 if len(y) == 0: X, y = get_trainning_data_omitoutliers() # 交叉验证 rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats) # 网格搜索最佳参数 if len(param_grid) > 0: gsearch = GridSearchCV(model, param_grid, cv=rkfold, scoring="neg_mean_squared_error", verbose=1, return_train_score=True) # 训练 gsearch.fit(X, y) # 最好的模型 model = gsearch.best_estimator_ best_idx = gsearch.best_index_ # 获取交叉验证评价指标 grid_results = pd.DataFrame(gsearch.cv_results_) cv_mean = abs(grid_results.loc[best_idx, 'mean_test_score']) cv_std = grid_results.loc[best_idx, 'std_test_score'] # 没有网格搜索 else: grid_results = [] cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold) cv_mean = abs(np.mean(cv_results)) cv_std = np.std(cv_results) # 合并数据 cv_score = pd.Series({'mean': cv_mean, 'std': cv_std}) # 预测 y_pred = model.predict(X) # 模型性能的统计数据 print('----------------------') print(model) print('----------------------') print('score=', model.score(X, y)) print('rmse=', rmse(y, y_pred)) print('mse=', mse(y, y_pred)) print('cross_val: mean=', cv_mean, ', std=', cv_std) # 残差分析与可视化 y_pred = pd.Series(y_pred, index=y.index) resid = y - y_pred mean_resid = resid.mean() std_resid = resid.std() z = (resid - mean_resid) / std_resid n_outliers = sum(abs(z) > 3) outliers = z[abs(z) > 3].index plt.figure(figsize=(15, 5)) ax_131 = plt.subplot(1, 3, 1) plt.plot(y, y_pred, '.') plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro') plt.xlabel('y') plt.ylabel('y_pred'); plt.title('corr = {:.3f}'.format(np.corrcoef(y, y_pred)[0][1])) ax_132 = plt.subplot(1, 3, 2) plt.plot(y, y - y_pred, '.') plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro') plt.xlabel('y') plt.ylabel('y - y_pred'); plt.title('std resid = {:.3f}'.format(std_resid)) ax_133 = plt.subplot(1, 3, 3) z.plot.hist(bins=50, ax=ax_133) z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133) plt.xlabel('z') plt.title('{:.0f} samples with z>3'.format(n_outliers)) plt.show() return model, cv_score, grid_results
定义训练变量存储数据:
# 定义训练变量存储数据
opt_models = dict()
score_models = pd.DataFrame(columns=['mean', 'std'])
# no. k-fold splits
splits = 5
# no. k-fold iterations
repeats = 5
#1 岭回归 def ridge(score_models): from sklearn.linear_model import Ridge model = 'Ridge' opt_models[model] = Ridge() alph_range = np.arange(0.25, 6, 0.25) param_grid = {'alpha': alph_range} opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats) cv_score.name = model score_models = score_models.append(cv_score) plt.figure() plt.errorbar(alph_range, abs(grid_results['mean_test_score']), abs(grid_results['std_test_score']) / np.sqrt(splits * repeats)) plt.xlabel('alpha') plt.ylabel('score') ridge(score_models) #调用
运行结果:
Fitting 25 folds for each of 23 candidates, totalling 575 fits
----------------------
Ridge(alpha=0.25)
----------------------
score= 0.888027818108409
rmse= 0.3298566876186673
mse= 0.10880543436675914
cross_val: mean= 0.11168708630202531 , std= 0.008426980362394657
图(1):真实值(横轴:y)与模型预测值(竖轴:y_pred)的散点图,图形上方显示了相关性数值,其越接近1越好。对于岭回归模型,相关性数值为0.942,预测值与真实值比较一致。
图(2):其为在交叉验证训练模型时,真实值(横轴:y)与模型预测值和真实值的残差(竖轴:y-y_pred)的散点图,图形上方显示了方差,其越小说明模型越稳定。可以看到,对于岭回归模型,在真实值y=-3附近的预测值有较大的偏差,同时,方差为0.330,较为稳定。
图(3):是由模型预测值和真实值的残差(横轴:z=(resid-mean_resid)/std_resid)与落在按z轴划分区间的频率(竖轴:频数)所画的直方图,图形上方显示了预测值与真实值的残差大于三倍标准差的数,其越小越好,越大说明预测中有些样本的偏差很大。对于岭回归模型,预测值与真实值的残差大于三倍标准差的数为7个,模型对偏差大的数由较好的包容性。
图(4)岭回归模型的参数(横轴:alpha)与模型的评价指标MSE(竖轴:score)的误差棒图。
可以看出,对于岭回归模型,随着alpha的增大,评价指标MSE的数值也越来越大,其模型的方差也逐渐增大。
#2 Lasso回归 def lasso_model(score_models): from sklearn.linear_model import Lasso model = 'Lasso' opt_models[model] = Lasso() alph_range = np.arange(1e-4, 1e-3, 4e-5) param_grid = {'alpha': alph_range} opt_models[model], cv_score, grid_results = train_model( opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats ) cv_score.name = model score_models = score_models.append(cv_score) plt.figure() plt.errorbar(alph_range, abs(grid_results['mean_test_score']), abs(grid_results['std_test_score']) / np.sqrt(splits * repeats)) plt.xlabel('alpha') plt.ylabel('score') plt.show() lasso_model(score_models)
运行结果:
Fitting 25 folds for each of 23 candidates, totalling 575 fits
----------------------
Lasso(alpha=0.0001)
----------------------
score= 0.8880419469176887
rmse= 0.3298358760689765
mse= 0.10879170514218935
cross_val: mean= 0.11165307128910605 , std= 0.008472101997779818
#3 ElasticNet回归
def elasticNet(score_models):
from sklearn.linear_model import ElasticNet
model = 'ElasticNet'
opt_models[model] = ElasticNet()
alpha_range = np.arange(1e-4, 1e-3, 1e-4)
param_grid = {'alpha': alpha_range,
'l1_ratio': np.arange(0.1, 1.0, 0.1)}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
elasticNet(score_models)
运行结果:
Fitting 5 folds for each of 81 candidates, totalling 405 fits
----------------------
ElasticNet(alpha=0.0001, l1_ratio=0.9)
----------------------
score= 0.8880421739306151
rmse= 0.3298355416712118
mse= 0.10879148454954166
cross_val: mean= 0.11111176609224302 , std= 0.010389628288261041
#4 SVR回归 def LinearSVR(score_models): from sklearn.svm import LinearSVR model = 'LinearSVR' opt_models[model] = LinearSVR() crange = np.arange(0.1, 1.0, 0.1) param_grid = {'C': crange, 'max_iter': [1000]} opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats) cv_score.name = model score_models = score_models.append(cv_score) plt.figure() plt.errorbar(crange, abs(grid_results['mean_test_score']), abs(grid_results['std_test_score']) / np.sqrt(splits * repeats)) plt.xlabel('C') plt.ylabel('score') plt.show() LinearSVR(score_models)
运行结果:
Fitting 25 folds for each of 9 candidates, totalling 225 fits
----------------------
LinearSVR(C=0.8)
----------------------
score= 0.12652926221937533
rmse= 0.9212858548009902
mse= 0.8487676262563905
cross_val: mean= 1.5528003977036449 , std= 1.8840817058123782
#5 K近邻 def KNeighbors(score_models): from sklearn.neighbors import KNeighborsRegressor model = 'KNeighbors' opt_models[model] = KNeighborsRegressor() param_grid = {'n_neighbors': np.arange(3, 11, 1)} opt_models[model], cv_score, grid_results = train_model( opt_models[model], param_grid=param_grid, splits=splits, repeats=1 ) cv_score.name = model score_models = score_models.append(cv_score) plt.figure() plt.errorbar(np.arange(3, 11, 1), abs(grid_results['mean_test_score']), abs(grid_results['std_test_score']) / np.sqrt(splits * 1)) plt.xlabel('n_neighbors') plt.ylabel('score') plt.show() KNeighbors(score_models)
运行结果:
Fitting 5 folds for each of 8 candidates, totalling 40 fits
----------------------
KNeighborsRegressor(n_neighbors=9)
----------------------
score= 0.7228187413258744
rmse= 0.5189818314777838
mse= 0.2693421414040353
cross_val: mean= 0.34772582431407223 , std= 0.026852830164121027
#1 GBDT模型
def GradientBoosting(score_models):
from sklearn.ensemble import GradientBoostingRegressor
model = 'GradientBoosting'
opt_models[model] = GradientBoostingRegressor()
param_grid = {'n_estimators': [150, 250, 300], 'max_depth': [1, 2, 3], 'min_samples_split': [5, 6, 7]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
GradientBoosting(score_models)
运行结果:
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------------------
GradientBoostingRegressor(min_samples_split=6, n_estimators=250)
----------------------
score= 0.9623464203351643
rmse= 0.19128167148359917
mse= 0.03658867784555958
cross_val: mean= 0.09987086194346159 , std= 0.008186545060144924
#2 XGB模型
def XGB(score_models):
from xgboost.sklearn import XGBRegressor
model = 'XGB'
opt_models[model] = XGBRegressor(objective='reg:squarederror')
param_grid = {'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [1, 2, 3]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
XGB(score_models)
运行结果:
Fitting 5 folds for each of 15 candidates, totalling 75 fits ---------------------- XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None) ---------------------- score= 0.9647156186164645 rmse= 0.18516609888951752 mse= 0.03428648417796254 cross_val: mean= 0.10613089925381643 , std= 0.012394637138140418
#3 随机森林模型
def RandomForest(score_models):
from sklearn.ensemble import RandomForestRegressor
model = 'RandomForest'
opt_models[model] = RandomForestRegressor()
param_grid = {'n_estimators': [100, 150, 200], 'max_features': [8, 12, 16, 20, 24], 'min_samples_split': [2, 4, 6]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=5, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
RandomForest(score_models)
运行结果:
Fitting 5 folds for each of 45 candidates, totalling 225 fits
----------------------
RandomForestRegressor(max_features=8, min_samples_split=6)
----------------------
score= 0.9782408712001981
rmse= 0.14540889512560917
mse= 0.021143746781650423
cross_val: mean= 0.10382773796901439 , std= 0.012036764974250679
def model_predict(test_data, test_y=[], stack=False): i = 0 y_predict_total = np.zeros((test_data.shape[0],)) for model in opt_models.keys(): if model != "LinearSVR" and model != "KNeighbors": y_predict = opt_models[model].predict(test_data) y_predict_total += y_predict i += 1 if len(test_y) > 0: print("{}_mse:".format(model), mean_squared_error(y_predict, test_y)) y_predict_mean = np.round(y_predict_total / i, 6) if len(test_y) > 0: print("mean_mse:", mean_squared_error(y_predict_mean, test_y)) else: y_predict_mean = pd.Series(y_predict_mean) return y_predict_mean model_predict(X_valid, y_valid)
运行结果:
Ridge_mse: 0.14302516401000598
Lasso_mse: 0.14297751247011622
ElasticNet_mse: 0.1429785691505049
LinearSVR_mse: 0.1429785691505049
KNeighbors_mse: 0.1429785691505049
GradientBoosting_mse: 0.13272499136615068
XGB_mse: 0.13895790991821866
RandomForest_mse: 0.1377255460562989
mean_mse: 0.12727329185670244
可以看到,模型融合预测的MSE最小,预测性能最优。
from sklearn.model_selection import KFold import pandas as pd import numpy as np from scipy import sparse import xgboost import lightgbm from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import ExtraTreesRegressor from sklearn.linear_model import ElasticNet from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error folds = 5 seed = 1 kf = KFold(n_splits=5, shuffle=True, random_state=0) def stacking_reg(clf, train_x, train_y, test_x, clf_name, kf, label_split=None): train = np.zeros((train_x.shape[0], 1)) test = np.zeros((test_x.shape[0], 1)) test_pre = np.empty((folds, test_x.shape[0], 1)) cv_scores = [] for i, (train_index, test_index) in enumerate(kf.split(train_x, label_split)): tr_x = train_x[train_index] tr_y = train_y[train_index] te_x = train_x[test_index] te_y = train_y[test_index] if clf_name in ['rf', 'ada', 'gb', 'et', 'lr', 'lsvc', 'knn', 'en']: clf.fit(tr_x, tr_y) pre = clf.predict(te_x).reshape(-1, 1) train[test_index] = pre test_pre[i, :] = clf.predict(test_x).reshape(-1, 1) cv_scores.append(mean_squared_error(te_y, pre)) elif clf_name in ['xgb']: train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1) test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1) z = clf.DMatrix(test_x, label=None, missing=-1) params = { 'booster': 'gbtree', # 树的结构 'eval_metric': 'rmse', 'gamma': 0, # 节点分裂所需的最小损失函数下降值 越大 算法越保守 'min_child_weight': 1, # 子集中实例重量的最小总和 如果小于这个数 就不继续分 越大越保守 'max_depth': 3, 'lambda': 1, # L2正则化系数 'subsample': 1, # 每棵树随机采样的比例 越小 越保守 'colsample_bytree': 1, # 每棵随机采样的列数的占比 'colsample_bylevel': 1, # 每棵树每次节点分裂的时候列采样的比例 'eta': 0.03, #更新中使用的步长搜索 缩小特征权重 'tree_method': 'exact', # 构建树的方法 'seed': 2022, 'nthread': 12 } num_round = 10000 early_stopping_rounds = 100 watchlist = [(train_matrix, 'train'), (test_matrix, 'eval')] if test_matrix: model = clf.train(params, train_matrix, num_boost_round=num_round, evals=watchlist, early_stopping_rounds=early_stopping_rounds) pre = model.predict(test_matrix, ntree_limit=model.best_ntree_limit).reshape(-1, 1) train[test_index] = pre test_pre[i, :] = model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1, 1) cv_scores.append(mean_squared_error(te_y, pre)) elif clf_name in ['lgb']: train_matrix = clf.Dataset(tr_x, label=tr_y) test_matrix = clf.Dataset(te_x, label=te_y) params = { 'boosting_type': 'gbdt', 'objective': 'regression_l2', 'metric': 'mse', 'min_child_weight': 1.5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'tree_method': 'exact', 'seed': 2022, 'nthread': 12, 'silent': True, # 输出细节 } num_round = 10000 early_stopping_rounds = 100 if test_matrix: model = clf.train(params, train_matrix, num_round, valid_sets=test_matrix, early_stopping_rounds=early_stopping_rounds) pre = model.predict(te_x, num_iteration=model.best_iteration).reshape(-1, 1) train[test_index] = pre test_pre[i, :] = model.predict(test_x, num_iteration=model.best_iteration).reshape(-1, 1) cv_scores.append(mean_squared_error(te_y, pre)) else: raise IOError("please add new clf") print("%s now score is:" % clf_name, cv_scores) test[:] = test_pre.mean(axis=0) print("%s_score_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) #print('{}_mean_squared_error: {}'.format(clf_name,mean_squared_error(y_valid,test))) #opt_models_new[clf_name] = mean_squared_error(y_valid,test) return train.reshape(-1, 1), test.reshape(-1, 1)
# 模型融合Stacking基学习器 def rf_reg(x_train, y_train, x_valid, kf, label_split=None): randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2022, max_features=8, verbose=1, min_samples_split=6) rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, 'rf', kf, label_split=label_split) return rf_train, rf_test, 'rf_reg' def ada_reg(x_train, y_train, x_valid, kf, label_split=None): adaboost = AdaBoostRegressor(n_estimators=30, random_state=2022, learning_rate=0.01) ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, 'ada', kf, label_split=label_split) return ada_train, ada_test, 'ada_reg' def gb_reg(x_train, y_train, x_valid, kf, label_split=None): gbdt = GradientBoostingRegressor(learning_rate=0.01, n_estimators=250, subsample=0.8, random_state=2022, max_depth=5, verbose=1, min_samples_split=6) gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, 'gb', kf, label_split=label_split) return gbdt_train, gbdt_test, 'gb_reg' def et_reg(x_train, y_train, x_valid, kf, label_split=None): extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features='auto', n_jobs=-1, random_state=2022, verbose=1) et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, 'et', kf, label_split=label_split) return et_train, et_test, 'et_reg' def lr_reg(x_train, y_train, x_valid, kf, label_split=None): lr_reg = LinearRegression(n_jobs=-1) lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, 'lr', kf, label_split=label_split) return lr_train, lr_test, 'lr_reg' def en_reg(x_train, y_train, x_valid, kf, label_split=None): en_reg = ElasticNet(alpha=0.0001, max_iter=1000, l1_ratio=0.9) en_train,en_test = stacking_reg(en_reg, x_train, y_train, x_valid, 'en', kf, label_split=label_split) return en_train,en_test,'en_reg' def xgb_reg(x_train, y_train, x_valid, kf, label_split=None): xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split) return xgb_train, xgb_test, "xgb_reg" def lgb_reg(x_train, y_train, x_valid, kf, label_split=None): lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split) return lgb_train, lgb_test, "lgb_reg"
# 对模型融合Stacking的预测函数进行定义 def stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin='lgb', if_concat_origin=True): for k, clf_list in enumerate(clf_list): clf_list = [clf_list] column_list = [] train_data_list = [] test_data_list = [] for clf in clf_list: train_data, test_data, clf_name = clf(x_train, y_train, x_valid, kf, label_split=label_split) train_data_list.append(train_data) test_data_list.append(test_data) column_list.append(test_data) train = np.concatenate(train_data_list, axis=1) test = np.concatenate(test_data_list, axis=1) if if_concat_origin: train = np.concatenate([x_train, train], axis=1) test = np.concatenate([x_valid, test], axis=1) print(x_train.shape) print(train.shape) print(clf_name) print(clf_name in ['lgb']) if clf_fin in ['rf', 'ada', 'gb', 'et', 'lr', 'lsvc', 'knn']: if clf_fin in ['rf']: clf = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2022, max_features='auto', verbose=1) elif clf_fin in ['ada']: clf = AdaBoostRegressor(n_estimators=30, random_state=2022, learning_rate=0.01) elif clf_fin in ['gb']: clf = GradientBoostingRegressor(learning_rate=0.01, n_estimators=100, subsample=0.8, random_state=2022, max_depth=5, verbose=1) elif clf_fin in ['et']: clf = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features='auto', n_jobs=-1, random_state=2022, verbose=1) elif clf_fin in ['lr']: clf = LinearRegression(n_jobs=-1) clf.fit(train, y_train) pre = clf.predict(test).reshape(-1, 1) return pre elif clf_fin in ['xgb']: clf = xgboost train_matrix = clf.DMatrix(train, label=y_train, missing=-1) test_matrix = clf.DMatrix(train, label=y_train, missing=-1) params = { 'booster': 'gbtree', # 树的结构 'eval_metric': 'rmse', 'gamma': 1, # 节点分裂所需的最小损失函数下降值 越大 算法越保守 'min_child_weight': 1.5, # 子集中实例重量的最小总和 如果小于这个数 就不继续分 越大越保守 'max_depth': 5, 'lambda': 10, # L2正则化系数 'subsample': 0.7, # 每棵树随机采样的比例 越小 越保守 'colsample_bytree': 0.7, # 每棵随机采样的列数的占比 'colsample_bylevel': 0.7, # 每棵树每次节点分裂的时候列采样的比例 'eta': 0.03, #更新中使用的步长搜索 缩小特征权重 'tree_method': 'exact', # 构建树的方法 'seed': 2022, 'nthread': 12 } num_round = 10000 early_stopping_rounds = 100 watchlist = [(train_matrix, 'train'), (test_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=num_round, evals=watchlist, early_stopping_rounds=early_stopping_rounds) pre = model.predict(test, ntree_limit=model.best_ntree_limit).reshape(-1, 1) return pre elif clf_fin in ['lgb']: print(clf_name) clf = lightgbm train_matrix = clf.Dataset(train, label=y_train) test_matrix = clf.Dataset(train, label=y_train) params = { 'boosting_type': 'gbdt', 'objective': 'regression_l2', 'metric': 'mse', 'min_child_weight': 1.5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'tree_method': 'exact', 'seed': 2022, 'nthread': 12, 'silent': True, # 输出细节 } num_round = 10000 early_stopping_rounds = 100 model = clf.train(params, train_matrix, num_round, valid_sets=test_matrix, early_stopping_rounds=early_stopping_rounds) print('pred') pre = model.predict(test, num_iteration=model.best_iteration).reshape(-1, 1) print(pre) return pre
读取数据,并采用5折交叉验证对数据集进行划分。
# 训练集数据和测试集数据 x_train = X_t.values x_valid = test.values y_train = y_t.values # 使用lr_reg和lgb_reg进行融合 # clf_list = [rf_reg, gb_reg] clf_list = [rf_reg, xgb_reg, gb_reg, et_reg, lr_reg, en_reg, lgb_reg] opt_models_new = dict() test_pre_k = np.empty((len(clf_list), x_valid.shape[0], 1)) for k, clf in enumerate(clf_list): tmp_train, tmp_test, name = clf(x_train, y_train, x_valid, kf) test_pre_k[k, :] = tmp_test pred = test_pre_k.mean(axis=0) # # print('bagging_mean_squared_error: %s' % mean_squared_error(y_valid, pred))
预测结果评估:0.1220。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。