赞
踩
第六章.决策树(Decision Tree)
CART决策树的生成就是递归地构建二叉决策树的过程。CART用基尼(Gini)系数最小化准则来进行特征选择,生成二叉树。
①.根节点的Gini系数:
②.根据是否有房来进行划分:
③.根据婚姻状况来进行划分:
④.根据年收入进行划分:
例如年收入为60和70这两个值时,我们算得其中间值为65。 倘若以中间值65作为分割点,则得Gini系数增益为 :
所有年收入的Gini系数增益:
⑤.根据计算可知,三个属性划分根节点的增益最大的有两个:婚姻状况和年收入属性,他们的增益都是0.12,可以随机选择一个作为根节点。(以婚姻状况为根节点举例)
婚姻状况作为根节点的Gini系数:
根据是否有房来进行划分:
根据年收入来进行划分:
⑥.最后构建的CART
剪枝可以在一定程度上抑制过拟合的问题。
①.小规模数据集有效
①.处理连续变量不好
②.类别较多时,错误增加的比较快
③.不能处理大量数据
import numpy as np from sklearn import tree import graphviz import csv # 加载数据 DTree = open(r'D:\\Data\\cart.csv', 'r') data_show = csv.reader(DTree) # 属性名称 feature_names = data_show.__next__() feature_names = np.array(feature_names)[1:-1] print(feature_names) # 标签名称 class_names = np.array(['no', 'yes']) print(class_names) data = np.genfromtxt('D:\\Data\\cart.csv', delimiter=',') # 模型 x_data = data[1:, 1:-1] print(x_data) # 标签 y_data = data[1:, -1] print(y_data) # 创建决策树模型 model_DicisionTree = tree.DecisionTreeClassifier() model_DicisionTree.fit(x_data, y_data) # 导出决策树 dot_data = tree.export_graphviz(model_DicisionTree, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph.render('cart')
import numpy as np from sklearn import tree from sklearn.metrics import classification_report import matplotlib.pyplot as plt import graphviz # 加载数据 data = np.genfromtxt('D:\\Data\\LR-testSet.csv', delimiter=',') x_data = data[:, :-1] y_data = data[:, -1] # 创建模型 model_DecisionTree = tree.DecisionTreeClassifier() model_DecisionTree.fit(x_data, y_data) predictions = model_DecisionTree.predict(x_data) print(classification_report(y_data, predictions)) # 导出决策树 feature_names = np.array(['x', 'y']) class_names = np.array(['label0', 'label1']) dot_data = tree.export_graphviz(model_DecisionTree, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph.render('Linear_classification') # 获取数据值所在范围 x_min, x_max = x_data[:, 0].min() - 1, x_data[:, 0].max() + 1 y_min, y_max = x_data[:, 1].min() - 1, x_data[:, 1].max() + 1 # 生成网格矩阵 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02)) z = model_DecisionTree.predict(np.c_[xx.ravel(), yy.ravel()]) z = z.reshape(xx.shape) # 绘制等高线 cs = plt.contourf(xx, yy, z) plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show()
import numpy as np from sklearn import tree from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import graphviz # 加载数据 data = np.genfromtxt('D:\\Data\\LR-testSet2.txt', delimiter=',') x_data = data[:, :-1] y_data = data[:, -1] # 分割数据 x_train, x_test, y_train, y_test = train_test_split(x_data, y_data) # 创建决策树模型 # max_depth,树的深度 # min_samples_split 内部节点再划分所需最小样本数 model_DecisionTree = tree.DecisionTreeClassifier(max_depth=7, min_samples_split=4) model_DecisionTree.fit(x_train, y_train) predictions = model_DecisionTree.predict(x_train) print(classification_report(predictions, y_train)) predictions = model_DecisionTree.predict(x_test) print(classification_report(predictions, y_test)) # 导出决策树 feature_names = np.array(['x', 'y']) class_names = np.array(['label0', 'label1']) dot_data = tree.export_graphviz(model_DecisionTree, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph.render('Linear_classification') # 获取数据值所在范围 x_min, x_max = x_data[:, 0].min() - 1, x_data[:, 0].max() + 1 y_min, y_max = x_data[:, 1].min() - 1, x_data[:, 1].max() + 1 # 生成网格矩阵 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02)) z = model_DecisionTree.predict(np.c_[xx.ravel(), yy.ravel()]) z = z.reshape(xx.shape) # 绘制等高线 cs = plt.contourf(xx, yy, z) plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。