赞
踩
转自:https://mp.weixin.qq.com/s/9gEfkiZyZkoIgwRCYISQgQ
LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。
LightGBM官网:https://lightgbm.readthedocs.io/en/latest/
参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html
本文内容如下,原始代码获取方式见文末。
1 安装方法
2 调用方法
2.1 定义数据集
2.2 模型训练
2.3 模型保存与加载
2.4 查看特征重要性
2.5 继续训练
2.6 动态调整模型超参数
2.7 自定义损失函数
2.8 调参方法
人工调参
网格搜索
贝叶斯优化
LightGBM的安装非常简单,在Linux下很方便的就可以开启GPU训练。可以优先选用从pip安装,如果失败再从源码安装。
安装方法:从源码安装
- git clone --recursive https://github.com/microsoft/LightGBM ;
- cd LightGBM
- mkdir build ; cd build
- cmake ..
-
- # 开启MPI通信机制,训练更快
- # cmake -DUSE_MPI=ON ..
-
- # GPU版本,训练更快
- # cmake -DUSE_GPU=1 ..
- make -j4
安装方法:pip安装
- # 默认版本
- pip install lightgbm
-
- # MPI版本
- pip install lightgbm --install-option=--mpi
-
- # GPU版本
- pip install lightgbm --install-option=--gpu
在Python语言中LightGBM提供了两种调用方式,分为为原生的API和Scikit-learn API,两种方式都可以完成训练和验证。当然原生的API更加灵活,看个人习惯来进行选择。
- df_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train', header=None, sep='\t')
- df_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test', header=None, sep='\t')
- W_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train.weight', header=None)[0]
- W_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test.weight', header=None)[0]
-
- y_train = df_train[0]
- y_test = df_test[0]
- X_train = df_train.drop(0, axis=1)
- X_test = df_test.drop(0, axis=1)
- num_train, num_feature = X_train.shape
-
- # create dataset for lightgbm
- # if you want to re-use data, remember to set free_raw_data=False
-
- lgb_train = lgb.Dataset(X_train, y_train,
- weight=W_train, free_raw_data=False)
-
- lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
- weight=W_test, free_raw_data=False)
-
- params = {
- 'boosting_type': 'gbdt',
- 'objective': 'binary',
- 'metric': 'binary_logloss',
- 'num_leaves': 31,
- 'learning_rate': 0.05,
- 'feature_fraction': 0.9,
- 'bagging_fraction': 0.8,
- 'bagging_freq': 5,
- 'verbose': 0
- }
-
- # generate feature names
- feature_name = ['feature_' + str(col) for col in range(num_feature)]
- gbm = lgb.train(params,
- lgb_train,
- num_boost_round=10,
- valid_sets=lgb_train, # eval training data
- feature_name=feature_name,
- categorical_feature=[21])
- # save model to file
- gbm.save_model('model.txt')
-
- print('Dumping model to JSON...')
- model_json = gbm.dump_model()
-
- with open('model.json', 'w+') as f:
- json.dump(model_json, f, indent=4)
- # feature names
- print('Feature names:', gbm.feature_name())
-
- # feature importances
- print('Feature importances:', list(gbm.feature_importance()))
- # continue training
- # init_model accepts:
- # 1. model file name
- # 2. Booster()
- gbm = lgb.train(params,
- lgb_train,
- num_boost_round=10,
- init_model='model.txt',
- valid_sets=lgb_eval)
- print('Finished 10 - 20 rounds with model file...')
-
- # decay learning rates
- # learning_rates accepts:
- # 1. list/tuple with length = num_boost_round
- # 2. function(curr_iter)
- gbm = lgb.train(params,
- lgb_train,
- num_boost_round=10,
- init_model=gbm,
- learning_rates=lambda iter: 0.05 * (0.99 ** iter),
- valid_sets=lgb_eval)
- print('Finished 20 - 30 rounds with decay learning rates...')
-
- # change other parameters during training
- gbm = lgb.train(params,
- lgb_train,
- num_boost_round=10,
- init_model=gbm,
- valid_sets=lgb_eval,
- callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])
- print('Finished 30 - 40 rounds with changing bagging_fraction...')
- # self-defined objective function
- # f(preds: array, train_data: Dataset) -> grad: array, hess: array
- # log likelihood loss
- def loglikelihood(preds, train_data):
- labels = train_data.get_label()
- preds = 1. / (1. + np.exp(-preds))
- grad = preds - labels
- hess = preds * (1. - preds)
- return grad, hess
-
- # self-defined eval metric
- # f(preds: array, train_data: Dataset) -> name: string, eval_result: float, is_higher_better: bool
- # binary error
- # NOTE: when you do customized loss function, the default prediction value is margin
- # This may make built-in evalution metric calculate wrong results
- # For example, we are doing log likelihood loss, the prediction is score before logistic transformation
- # Keep this in mind when you use the customization
- def binary_error(preds, train_data):
- labels = train_data.get_label()
- preds = 1. / (1. + np.exp(-preds))
- return 'error', np.mean(labels != (preds > 0.5)), False
-
- gbm = lgb.train(params,
- lgb_train,
- num_boost_round=10,
- init_model=gbm,
- fobj=loglikelihood,
- feval=binary_error,
- valid_sets=lgb_eval)
- print('Finished 40 - 50 rounds with self-defined objective function and eval metric...')
For Faster Speed
Use bagging by setting bagging_fraction
and bagging_freq
Use feature sub-sampling by setting feature_fraction
Use small max_bin
Use save_binary
to speed up data loading in future learning
Use parallel learning, refer to Parallel Learning Guide <./Parallel-Learning-Guide.rst>
__
For Better Accuracy
Use large max_bin
(may be slower)
Use small learning_rate
with large num_iterations
Use large num_leaves
(may cause over-fitting)
Use bigger training data
Try dart
Deal with Over-fitting
Use small max_bin
Use small num_leaves
Use min_data_in_leaf
and min_sum_hessian_in_leaf
Use bagging by set bagging_fraction
and bagging_freq
Use feature sub-sampling by set feature_fraction
Use bigger training data
Try lambda_l1
, lambda_l2
and min_gain_to_split
for regularization
Try max_depth
to avoid growing deep tree
Try extra_trees
Try increasing path_smooth
- lg = lgb.LGBMClassifier(silent=False)
- param_dist = {"max_depth": [4,5, 7],
- "learning_rate" : [0.01,0.05,0.1],
- "num_leaves": [300,900,1200],
- "n_estimators": [50, 100, 150]
- }
-
- grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 5, scoring="roc_auc", verbose=5)
- grid_search.fit(train,y_train)
- grid_search.best_estimator_, grid_search.best_score_
-
- import warnings
- import time
- warnings.filterwarnings("ignore")
- from bayes_opt import BayesianOptimization
- def lgb_eval(max_depth, learning_rate, num_leaves, n_estimators):
- params = {
- "metric" : 'auc'
- }
- params['max_depth'] = int(max(max_depth, 1))
- params['learning_rate'] = np.clip(0, 1, learning_rate)
- params['num_leaves'] = int(max(num_leaves, 1))
- params['n_estimators'] = int(max(n_estimators, 1))
- cv_result = lgb.cv(params, d_train, nfold=5, seed=0, verbose_eval =200,stratified=False)
- return 1.0 * np.array(cv_result['auc-mean']).max()
-
- lgbBO = BayesianOptimization(lgb_eval, {'max_depth': (4, 8),
- 'learning_rate': (0.05, 0.2),
- 'num_leaves' : (20,1500),
- 'n_estimators': (5, 200)}, random_state=0)
-
- lgbBO.maximize(init_points=5, n_iter=50,acq='ei')
- print(lgbBO.max)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。