赞
踩
这个在sklearn当中有一个,但是有另一个功能更加强大的,只要
pip3 install xgboost
即可安装,不过这个安装过程真是一波三折的。
然后,我们需要知道xgboost使用的大体流程,以下几个示例都没离开这个流程框架:
在这个示例里面,涉及到一个agaricus数据,这个单词意思是:巴西蘑菇。但是巴西蘑菇有很多种类,有的有毒,有的没有毒,能否预测出一个给定的蘑菇有毒还是没毒呢?
import xgboost as xgb import numpy as np # 1、xgBoost的基本使用 # 2、自定义损失函数的梯度和二阶导 train_data = 'xgboost_data/agaricus_train.txt' test_data = 'xgboost_data/agaricus_test.txt' # 定义一个损失函数 def log_reg(y_hat, y): p = 1.0 / (1.0 + np.exp(-y_hat)) g = p - y.get_label() h = p * (1.0 - p) return g, h # 错误率,本例子当中,估计值<0.5代表没有毒 def error_rate(y_hat, y): return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat) if __name__ == "__main__": # 读取数据 data_train = xgb.DMatrix(train_data) data_test = xgb.DMatrix(test_data) # 设置参数 param = { 'max_depth': 3, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'} # logitraw watchlist = [(data_test, 'eval'), (data_train, 'train')] n_round = 7 bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=log_reg, feval=error_rate) # 计算错误率 y_hat = bst.predict(data_test) y = data_test.get_label() print('y_hat',y_hat) print('y', y) error = sum(y != (y_hat > 0.5)) error_rate = float(error) / len(y_hat) print('样本总数:\t', len(y_hat)) print('错误数目:\t%4d' % error) print('错误率:\t%.5f%%' % (100 * error_rate))
说明:
开头定义的log_reg还有error_rate最后会被用在下面的train方法中,对应当中的obj参数和feval参数,意思是:用用户定义的损失函数log_reg,来进行提升。采用用户定义的错误率error_rate来进行错误率的预测。
关于当中的train函数:
def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
maximize=None, early_stopping_rounds=None, evals_result=None,
verbose_eval=True, xgb_model=None, callbacks=None)
"""
dtrain:训练数据
num_boost_round:数据提升时候的迭代次数
evals:验证,传进去一个元组,里面指定什么是训练集,哪些是测试集
"""
在train当中有一个params,这个时候就涉及到:Booster参数了
max_depth: 指定决策树的深度
eta: 学习率,默认0.1
silent:静默模式。该值如果是1,模型运行不输出
objective:给定损失函数,默认为:binary:logistic,或者reg:linear
在xgboost当中,会把相关数据存放在DMatrix数据结构当中,这个数据结构是一个二维矩阵,但是xgboost当中对其进行了优化。
在以上代码当中不停的出现get_label方法,那么具体什么是label呢?
有一句英文解释很明了:
The label is the name of some category. If you’re building a machine learning system to distinguish fruits coming down a conveyor belt, labels for training samples might be “apple”, " orange", “banana”. The features are any kind of information you can extract about each sample. In our example, you might have one feature for colour, another for weight, another for length, and another for width. Maybe you would have some measure of concavity or linearity or ball-ness.
即:落实到使用中,就是label代表了最后你这个到底是什么东西,二特征feature则代表那一个个属性
在这个示例当中,用了鸢尾花数据集。鸢尾花其实有很多种类,对于不同的鸢尾花(本数据集当中有三类,分别是:Setosa, Versicolor, Virginica)。不同种类的鸢尾花花儿宽度,叶子长度等属性,都不尽相同。我们用XGBoost训练一下,看能不能有效的对相关数据做预测。
import numpy as np import pandas as pd import xgboost as xgb from sklearn.model_selection import train_test_split # cross_validation def iris_type(s): it = { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2} return it[s] if __name__ == "__main__": path = 'xgboost_data/iris.data' # 数据文件路径 data = pd.read_csv(path, header=None) x, y = data[range(4)], data[4] y = pd.Categorical(y).codes x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=50) data_train = xgb.DMatrix(x_train, label=y_train) data_test = xgb.DMatrix(x_test, label=y_test) watch_list = [(data_test, 'eval'), (data_train, 'train')] #决策树深度为2,学习率是0.3, param = { 'max_depth': 2, 'eta': 0.3, 'silent': 1, 'objective': 'multi:softmax', 'num_class': 3} bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list) y_hat = bst.predict(data_test) result = y_test.reshape(1, -1) == y_hat print('正确率:\t', float(np.sum(result)) / len(y_hat)) print('END.....\n')
说明:
pandas.Categorical(val,category = None,ordered = None,dtype = None)
"""
val :[list-like] The values of categorical.
categories:[index like] Unique categorisation of the categories.
ordered :[boolean] If false, then the categorical is treated as unordered.
dtype :[CategoricalDtype] an instance.
Error-
ValueError: If the categories do not validate.
TypeError : If an explicit ordered = True but categorical can't be sorted.
Return- Categorical varibale
"""
[reshape(1,-1)转化成1行
[reshape(2,-1)转换成两行
[reshape(-1,1)转换成1列
[reshape(-1,2)转化成两列
我们还是拿经典的鸢尾花数据集,来用SVM方法来做预测
import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt from sklearn import svm from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 'sepal length', 'sepal width', 'petal length', 'petal width' iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度' if __name__ == "__main__": path = "./iris.data" # 数据文件路径 data = pd.read_csv(path, header=None) x, y = data[range(4)], data[4] y = pd.Categorical(y).codes # 按照花的类型进行分组 x = x[[0, 1]] # 只取第0,和第1列 x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6) # 分类器 clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr') clf.fit(
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。