赞
踩
译文:Complete Guide to Parameter Tuning in XGBoost
当模型没有达到预期效果的时候,XGBoost就是数据科学家的最终武器。XGboost是一个高度复杂的算法,有足够的能力去学习数据的各种各样的不规则特征。
用XGBoost建模很简单,但是提升XGBoost的模型效果却需要很多的努力。因为这个算法使用了多维的参数。为了提升模型效果,调参就不可避免,但是想要知道参数怎么调,什么样的参数能够得出较优的模型输出就很困难了。
这篇文章适用于XGBoost新手,教会新手学习使用XGBoost的一些有用信息来帮忙调整参数。
XGBoost(extreme Gradient Boosting) 是一个高级的梯度增强算法(gradient boosting algorithm),推荐看一下我前一篇翻译的自该作者的文章
加深对XGBoost的理解的文章:
1.XGBoost Guide – Introduction to Boosted Trees
2.Words from the Author of XGBoost
XGBoost的变量类型有三类:
1.General Parameters:
booster [default=gbtree]:
silent [default=0]:
2.Booster Parameters
虽然XGBoost有两种boosters,作者在参数这一块只讨论了tree booster,原因是tree booster的表现总是好于 linear booster
3.Learning Task Parameters
此类变量用于定义优化目标每一次计算的需要用到的变量
有些变量在Python的sklearn的接口中对应命名如下:
1. eta -> learning rate
2. lambda ->reg_lambda
3. alpha -> reg_alpha
可能感到困惑的是这里并没有像GBM中一样提及n_estimators,这个参数实际存在于XGBClassifier中,但实际是通过num_boosting_rounds在我们调用fit函数事来体现的。
作者推荐以下链接,进一步加深对XGBOOST的了解:
1.XGBoost Parameters (official guide)
2.XGBoost Demo Codes (xgboost GitHub repository)
3.Python API Reference (official guide)
- #导入需要的数据和库
- #Import libraries:
- import pandas as pd
- import numpy as np
- import xgboost as xgb
- from xgboost.sklearn import XGBClassifier
- from sklearn import cross_validation, metrics #Additional scklearn functions
- from sklearn.grid_search import GridSearchCV #Perforing grid search
-
- import matplotlib.pylab as plt
- %matplotlib inline
- from matplotlib.pylab import rcParams
- rcParams['figure.figsize'] = 12, 4
-
- train = pd.read_csv('train_modified.csv')
- target = 'Disbursed'
- IDcol = 'ID'

此处作者调用了两种类型的XGBoost:
1.xgb:xgboost直接的库,可以调用cv函数
2.XGBClassifier: sklearn对XGBoost的包装,可以允许使用sklearn的网格搜索功能进行并行计算
- #定义一个函数帮助产生xgboost模型及其效果
- def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
-
- if useTrainCV:
- xgb_param = alg.get_xgb_params()
- xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
- cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
- metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
- alg.set_params(n_estimators=cvresult.shape[0])
-
- #Fit the algorithm on the data
- alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
-
- #Predict training set:
- dtrain_predictions = alg.predict(dtrain[predictors])
- dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
-
- #Print model report:
- print "\nModel Report"
- print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
- print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
-
- feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
- feat_imp.plot(kind='bar', title='Feature Importances')
- plt.ylabel('Feature Importance Score')
-
- #xgboost’s sklearn没有feature_importances,但是#get_fscore() 有相同的功能

General Approach for Parameter Tuning
通常的做法如下:
1.选择一个相对高一点的学习率(learning rate):通常0.1是有用的,但是根据问题的不同,可以选择范围在[0.05,0.3]之间,根据选好的学习率选择最优的树的数目,xgboost有一个非常有用的cv函数可以用于交叉验证并能返回最终的最优树的数目
2.调tree-specific parameters(max_depth, min_child_weight, gamma, subsample, colsample_bytree)
3.调regularization parameters(lambda, alpha)
4.调低学习率并决定优化的参数
step1:Fix learning rate and number of estimators for tuning tree-based parameters
1.设置参数的初始值:
- #通过固定的学习率0.1和cv选择合适的树的数量
- #Choose all predictors except target & IDcols
- predictors = [x for x in train.columns if x not in [target, IDcol]]
- xgb1 = XGBClassifier(
- learning_rate =0.1,
- n_estimators=1000,
- max_depth=5,
- min_child_weight=1,
- gamma=0,
- subsample=0.8,
- colsample_bytree=0.8,
- objective= 'binary:logistic',
- nthread=4,
- scale_pos_weight=1,
- seed=27)
- modelfit(xgb1, train, predictors)
- #作者调整后得到的树的值为140,如果这个值对于当前的系统而言太大了,可以调高学习率重新训练

step2:Tune max_depth and min_child_weight
先调这两个参数的原因是因为这两个参数对模型的影响做大
- param_test1 = {
- 'max_depth':range(3,10,2),
- 'min_child_weight':range(1,6,2)
- }
- gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
- min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
- param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch1.fit(train[predictors],train[target])
- gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
最优的 max_depth=5,min_child_weight=5
因为之前的步长是2,在最优参数的基础上,在上调下调各一步,看是否能得到更好的参数
- param_test2 = {
- 'max_depth':[4,5,6],
- 'min_child_weight':[4,5,6]
- }
- gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
- min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch2.fit(train[predictors],train[target])
- gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
以上结果跑出来的最优参数为:max_depth=4,min_child_weight=6,另外从作者跑出来的cv结果看,再提升结果比较困难,可以进一步对min_child_weight试着调整看一下效果:
- param_test2b = {
- 'min_child_weight':[6,8,10,12]
- }
- gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
- min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch2b.fit(train[predictors],train[target])
- modelfit(gsearch3.best_estimator_, train, predictors)
- gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_
step3:Tune gamma
- param_test3 = {
- 'gamma':[i/10.0 for i in range(0,5)]
- }
- gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
- min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch3.fit(train[predictors],train[target])
- gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
基于以上调好参数的前提下,可以看一下模型的特征的表现:
- xgb2 = XGBClassifier(
- learning_rate =0.1,
- n_estimators=1000,
- max_depth=4,
- min_child_weight=6,
- gamma=0,
- subsample=0.8,
- colsample_bytree=0.8,
- objective= 'binary:logistic',
- nthread=4,
- scale_pos_weight=1,
- seed=27)
- modelfit(xgb2, train, predictors)
step4: Tune subsample and colsample_bytree
- param_test4 = {
- 'subsample':[i/10.0 for i in range(6,10)],
- 'colsample_bytree':[i/10.0 for i in range(6,10)]
- }
- gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
- min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch4.fit(train[predictors],train[target])
- gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
- #上一步发现最优值均为0.8,这一步做的事情是在附近以0.05的步长做调整
- param_test5 = {
- 'subsample':[i/100.0 for i in range(75,90,5)],
- 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
- }
- gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
- min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch5.fit(train[predictors],train[target])
Step 5: Tuning Regularization Parameters
这一步的作用是通过使用过regularization 来降低过拟合问题,大部分的人选择忽略这个参数,因为gamma 有提供类似的功能
- param_test6 = {
- 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
- }
- gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
- min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch6.fit(train[predictors],train[target])
- gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_
这一步调参之后结果可能会变差,方法是在获得的最优的参数0.01附近进行微调,看能否获得更好的结果
- param_test7 = {
- 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
- }
- gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
- min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
- objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
- param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
- gsearch7.fit(train[predictors],train[target])
- gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_
然后基于获得的更好的值,我们再看一下模型的整体表现
- xgb3 = XGBClassifier(
- learning_rate =0.1,
- n_estimators=1000,
- max_depth=4,
- min_child_weight=6,
- gamma=0,
- subsample=0.8,
- colsample_bytree=0.8,
- reg_alpha=0.005,
- objective= 'binary:logistic',
- nthread=4,
- scale_pos_weight=1,
- seed=27)
- modelfit(xgb3, train, predictors)
Step 6: Reducing Learning Rate
最后一步就是 降低学习率并增加更多的树
- xgb4 = XGBClassifier(
- learning_rate =0.01,
- n_estimators=5000,
- max_depth=4,
- min_child_weight=6,
- gamma=0,
- subsample=0.8,
- colsample_bytree=0.8,
- reg_alpha=0.005,
- objective= 'binary:logistic',
- nthread=4,
- scale_pos_weight=1,
- seed=27)
- modelfit(xgb4, train, predictors)
最后作者分享了两条经验:
1.仅仅通过调参来提升模型的效果是很难的
2.想要提升模型的效果,还可以通过特征工程、模型融合以及stacking方法
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。