当前位置:   article > 正文

Python机器学习08——决策树算法_python决策树结果怎么看

python决策树结果怎么看

本系列所有的代码和数据都可以从陈强老师的个人主页上下载:Python数据程序

参考书目:陈强.机器学习及Python应用. 北京:高等教育出版社, 2021.

本系列基本不讲数学原理,只从代码角度去让读者们利用最简洁的Python代码实现机器学习方法


本章继续非参数的方法——决策树。决策树方法很早就成熟了,因为它直观便捷,和计算机的一些底层逻辑结构很像,一直都有广泛的应用。其最早有ID3、C4.5、C5.0、CART等等。但其实都大同小异,损失函数不一样而已,还有分裂节点个数不一样。CRAT算法是二叉树,数学本质就是切割样本取值空间。因此决策树的决策边界都是矩形区域,(类似楼梯),下面会一一展示。同样决策树可以用于分类和回归问题。分类问题叫分类树。回归问题叫回归树。决策树还有一个优点在于可以度量变量的重要性。就是可以了解到变量x1,x2,x3,x4.....谁对y的影响最大。这是线性回归给不了的,而且线性回归假设太多,决策树就基本没有,所以适用范围更广。首先介绍回归树。

回归树Python案例

还是导入包和数据,采用回归问题常用的波士顿房价数据集:

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. from sklearn.model_selection import train_test_split
  6. from sklearn.model_selection import KFold, StratifiedKFold
  7. from sklearn.model_selection import GridSearchCV
  8. from sklearn.tree import DecisionTreeRegressor,export_text
  9. from sklearn.tree import DecisionTreeClassifier, plot_tree
  10. from sklearn.datasets import load_boston
  11. from sklearn.metrics import cohen_kappa_score
  12. Boston=load_boston()
  13. Boston.feature_names
  14. #波士顿房价数据集在sklearn库后面版本可能被移除...因为有种族问题...可以用下面的方法替代
  15. data_url = "http://lib.stat.cmu.edu/datasets/boston"
  16. raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
  17. data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
  18. target = raw_df.values[1::2, 2]

数据在线性回归那一节展示过了,现在就直接开始划分训练集和测试集,然后拟合评价,打印出决策树

  1. # Data Preparation
  2. X_train, X_test, y_train, y_test = train_test_split(data,target, test_size=0.3, random_state=0)
  3. # Regression Tree
  4. model = DecisionTreeRegressor(max_depth=2, random_state=123)
  5. model.fit(X_train, y_train)
  6. model.score(X_test, y_test)
  7. #打印决策树
  8. print(export_text(model,feature_names=list(Boston.feature_names)))

 sklearn还支持直接画决策树的图:

  1. plot_tree(model, feature_names=Boston.feature_names, node_ids=True, rounded=True, precision=2)
  2. plt.tight_layout()
  3. plt.savefig('tree.png',dpi=200)

 决策树也是可以进行正则化的,防止过拟合,减低模型复杂度。惩罚系数和决策树的损失可视化:

  1. # Graph total impurities versus ccp_alphas
  2. model = DecisionTreeRegressor(random_state=123)
  3. path = model.cost_complexity_pruning_path(X_train, y_train)
  4. plt.plot(path.ccp_alphas, path.impurities, marker='o', drawstyle='steps-post')
  5. plt.xlabel('alpha (cost-complexity parameter)')
  6. plt.ylabel('Total Leaf MSE')
  7. plt.title('Total Leaf MSE vs alpha for Training Set')
  8. max(path.ccp_alphas), max(path.impurities)

 网格化搜索最优超参数——惩罚系数

  1. param_grid = {'ccp_alpha': path.ccp_alphas}
  2. kfold = KFold(n_splits=10, shuffle=True, random_state=1)
  3. model = GridSearchCV(DecisionTreeRegressor(random_state=123), param_grid, cv=kfold)
  4. model.fit(X_train, y_train)
  5. model.best_params_
  6. model = model.best_estimator_
  7. model.score(X_test,y_test)
  8. plot_tree(model, feature_names=Boston.feature_names, node_ids=True, rounded=True, precision=2)
  9. plt.tight_layout()
  10. plt.savefig('tree2.png',dpi=900)

这个树的分支有点多

下面是模型的参数和变量重要性的可视化:

  1. #决策树的深度
  2. model.get_depth()
  3. #叶子节点个数
  4. model.get_n_leaves()
  5. #所有参数
  6. model.get_params()
  7. # Visualize Feature Importance
  8. #变量重要性的可视化
  9. model.feature_importances_
  10. sorted_index = model.feature_importances_.argsort()
  11. X = pd.DataFrame(Boston.data, columns=Boston.feature_names)
  12. plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
  13. plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
  14. plt.xlabel('Feature Importance')
  15. plt.ylabel('Feature')
  16. plt.title('Decision Tree')
  17. plt.tight_layout()

 

 可以看出对于房价影响最大的是房间数量RM

看预测的拟合值和真实值的对比:

  1. pred = model.predict(X_test)
  2. plt.scatter(pred, y_test, alpha=0.6)
  3. w = np.linspace(min(pred), max(pred), 100)
  4. plt.plot(w, w)
  5. plt.xlabel('pred')
  6. plt.ylabel('y_test')
  7. plt.title('Tree Prediction')

还不错


 

 分类树的Python案例

下面采用银行市场营销的数据,响应变量y为是否贷款,导包读取数据

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.model_selection import KFold, StratifiedKFold
  6. from sklearn.model_selection import GridSearchCV
  7. from sklearn.tree import DecisionTreeRegressor,export_text
  8. from sklearn.tree import DecisionTreeClassifier, plot_tree
  9. from sklearn.datasets import load_boston
  10. from sklearn.metrics import cohen_kappa_score
  11. bank = pd.read_csv('bank-additional.csv', sep=';')
  12. bank.shape
  13. pd.options.display.max_columns = 30
  14. bank.head()

 数据长上面那样子,变量有点多,下面清洗,对分类数据生成虚拟变量后,然后划分训练测试集,开始决策树的拟合。

  1. # Drop 'duration' variable
  2. bank = bank.drop('duration', axis=1)
  3. X_raw = bank.iloc[:, :-1]
  4. X = pd.get_dummies(X_raw)
  5. X.head(2)
  6. #取出y
  7. y = bank.iloc[:, -1]
  8. X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=1000, random_state=1)
  9. # Classification Tree
  10. model = DecisionTreeClassifier(max_depth=2, random_state=123)
  11. model.fit(X_train, y_train)#拟合
  12. #评价
  13. model.score(X_test, y_test)
  14. #画决策树
  15. plot_tree(model, feature_names=X.columns, node_ids=True, rounded=True, precision=2)

 惩罚系数和损失函数的关系

  1. # Graph total impurities versus ccp_alphas
  2. model = DecisionTreeClassifier(random_state=123)
  3. path = model.cost_complexity_pruning_path(X_train, y_train)
  4. plt.plot(path.ccp_alphas, path.impurities, marker='o', drawstyle='steps-post')
  5. plt.xlabel('alpha (cost-complexity parameter)')
  6. plt.ylabel('Total Leaf Impurities')
  7. plt.title('Total Leaf Impurities vs alpha for Training Set')
  8. max(path.ccp_alphas), max(path.impurities)

 网格化搜索最优超参数

  1. param_grid = {'ccp_alpha': path.ccp_alphas}
  2. kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
  3. model = GridSearchCV(DecisionTreeClassifier(random_state=123), param_grid, cv=kfold)
  4. model.fit(X_train, y_train)
  5. model.best_params_
  6. model = model.best_estimator_
  7. model.score(X_test, y_test)
  8. plot_tree(model, feature_names=X.columns, node_ids=True, impurity=True, proportion=True, rounded=True, precision=2)

变量重要性可视化

  1. model.feature_importances_
  2. sorted_index = model.feature_importances_.argsort()
  3. plt.barh(range(X_train.shape[1]), model.feature_importances_[sorted_index])
  4. plt.yticks(np.arange(X_train.shape[1]), X_train.columns[sorted_index])
  5. plt.xlabel('Feature Importance')
  6. plt.ylabel('Feature')
  7. plt.title('Decision Tree')
  8. plt.tight_layout()

 

  1. #算混淆矩阵
  2. # Prediction Performance
  3. pred = model.predict(X_test)
  4. table = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
  5. table

 

计算混淆矩阵的指标

  1. table = np.array(table)
  2. Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
  3. Accuracy
  4. Sensitivity = table[1, 1] / (table[1, 0] + table[1, 1])
  5. Sensitivity
  6. cohen_kappa_score(y_test, pred)

 采用0.1为分类阈值,计算混淆矩阵和指标

  1. prob = model.predict_proba(X_test)
  2. prob
  3. model.classes_
  4. prob_yes = prob[:, 1]
  5. pred_new = (prob_yes >= 0.1)
  6. pred_new
  7. table = pd.crosstab(y_test, pred_new, rownames=['Actual'], colnames=['Predicted'])
  8. table
  9. table = np.array(table)
  10. Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
  11. Accuracy
  12. Sensitivity = table[1, 1] / (table[1, 0] + table[1, 1])
  13. Sensitivity

使用交叉熵为损失函数,进行网格化搜参进行最优决策树估计

  1. param_grid = {'ccp_alpha': path.ccp_alphas}
  2. kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
  3. model = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=123), param_grid, cv=kfold)
  4. model.fit(X_train, y_train)
  5. model.score(X_test, y_test)
  6. pred = model.predict(X_test)
  7. pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])


使用鸢尾花数据集两个特征变量进行决策边界的可视化画 

  1. ## Decision boundary for iris data
  2. from sklearn.datasets import load_iris
  3. from mlxtend.plotting import plot_decision_regions
  4. X,y = load_iris(return_X_y=True)
  5. X2 = X[:, 2:4]
  6. model = DecisionTreeClassifier(random_state=123)
  7. path = model.cost_complexity_pruning_path(X2, y)
  8. param_grid = {'ccp_alpha': path.ccp_alphas}
  9. kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
  10. model = GridSearchCV(DecisionTreeClassifier(random_state=123), param_grid, cv=kfold)
  11. model.fit(X2, y)
  12. model.score(X2, y)
  13. plot_decision_regions(X2, y, model)
  14. plt.xlabel('petal_length')
  15. plt.ylabel('petal_width')
  16. plt.title('Decision Boundary for Decision Tree')

可以看出,决策边界都是矩形区域。

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/331064
推荐阅读
相关标签
  

闽ICP备14008679号