赞
踩
本章介绍机器学习中一个非常重要的监督学习算法,决策树,决策树有很多分类,如CART(基于基尼系数,可用于分类或回归),ID3(基于信息增益,使用信息增益来选择特征),C4.5(基于信息增益,与ID3类似,但使用信息增益的比例来选择特征)等,这里介绍ID3。
包括以下内容:
部分内容引用自《Machine Learning in Action》
问题提出:
假定某个数据集S包含M个元素,每个元素都有属性(特征)X1,X2,...,Xk,且每个元素都有具体的分类,分类的集合由C1,C2,...,Cn组成。若给定数据集S外的其它某个元素的属性(x1,x2,...,xk),求该元素属于哪个分类。
思路:
决策树的的思想是先通过某种方式,将数据集转换成一颗树(决策树),再按此决策树中定义的顺序依次判断目标元素的属性值x1,x2,...,xk,最后定位到的叶子节点的分类就是目标元素的分类。决策树的构造相对比较费时,但判断分类是非常高效的。
举例:
假设通过某种方式,我们得到了以下用于判断是否去相亲的决策树:
现在给定一人A,其属性为(白,富,美),通过决策树可以得到结果“去”,说明可以去相亲。再给定另一人B,其属性为(不富,白,不美),通过决策树得到结果“犹豫”,说明需要考虑一下是否去相亲。可以看出,决策树可以非常快速的判断某个元素属于哪个分类。
在ID3的算法中,为了得到一棵合理的决策树,我们需要知道信息,熵和信息增益的概念。
信息:
假设数据集S中有n个分类,C1,C2,...,Cn,则第i个分类的信息定义为:
其中, 为
在所有分类中出现的概率。通过此公式可知,概率越大,信息就越接近于0,反之越大于0。注意,信息是基于分类的,而不是基于集合中的元素。
熵:
熵可以称为香农熵,是克劳德香农在二十世纪发明的,其描述的是任意数据集S中元素的混乱程度。熵的值越大表示数据越混乱,越小表示越统一。熵定义为信息的期望:
注意,熵基于信息,所以熵也是针对数据集中的分类,而不是数据集中的元素。例如,假设数据集S中有100个元素,所有这些元素的属性值都不一样且相差很大,但这些元素都属于同一个分类Ci。那么该数据集的熵为0,表示所有元素绝对统一,没有任何混乱(都属于同一个分类,所以统一)。
信息增益:
信息增益定义为熵的减少。假设初始情况下数据集S的熵为,经过某些操作后其熵变为
,则该操作带来的信息增益为:
如果为正数,表示经过某些操作后能减少S的混乱程度,能让S中的元素更加统一。但如果
为负数,表示该操作让S中的的元素更加混乱。下面通过实例计算数据集的熵。
创建Python模块 entropy.py,输入以下代码:
- import math
-
-
- def cal_entropy(data_set):
- """
- This function calculates entropy of a data set, no matter how many attributes that data set have, only depends on
- the last value of each element, it's the element category(label).
- :param data_set: data set
- :return: entropy of a data set
- """
- labels = {}
- for data in data_set:
- label = data[-1]
- if label not in labels.keys():
- labels[label] = 0
- labels[label] += 1
- entropy = 0.0
- for label in labels:
- prob = float(labels[label]) / len(data_set)
- entropy += (-prob * math.log2(prob))
- return entropy
-
-
- if __name__ == '__main__':
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- entropy = cal_entropy(data_set)
- print(entropy)
-
- data_set = [[1, 1, 'yes'], [1, 1, 'not sure'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- entropy = cal_entropy(data_set)
- print(entropy)
'运行
运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/entropy.py
- 0.9709505944546686
- 1.3709505944546687
-
- Process finished with exit code 0
可以看出,数据集1中的元素相对于数据集2中的元素更加统一。直观来看,数据集1中只有两个分类,yes和no,而数据集2中增加了一个分类not sure,所以数据集2更加混乱。
一个数据集中的元素有很多属性(特征),我们需要确定划分属性的先后顺序。只要能得到属性划分的先后顺序,就很容易构建出最终的决策树。为了弄清楚这个问题,我们先要尝试划分数据集。
设原数据集为S,其属性有C1,C2,...,Cn,若按照第i个属性的某个值v划分数据集,则将得到新的数据子集Si。其中,Si是S中第i个属性的值为v的所有元素的集合,但是每个元素都去掉了第i个属性本身(减少属性,便于收敛)。
实例:
创建模块 split_data_set.py,输入以下代码:
- def split_data_set(data_set, axis, value):
- """
- Split data set by giving axis and value, only return the sub data set which element value on axis equals to the
- giving value, and sub data will not include the value on axis index.
- :param data_set: data set
- :param axis: axis index
- :param value: giving value
- :return: sub data set
- """
- return_data_set = []
- for data in data_set:
- if data[axis] == value:
- reduced_data_set = data[:axis]
- reduced_data_set.extend(data[axis + 1:])
- return_data_set.append(reduced_data_set)
- return return_data_set
-
-
- if __name__ == '__main__':
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- print("Original data set: %r" % data_set)
- result = split_data_set(data_set, 0, 1)
- print("Split index 0, value 1: %r" % result)
- result = split_data_set(data_set, 1, 0)
- print("Split index 1, value 0: %r" % result)
'运行
运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/split_data_set.py
- Original data set: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- Split index 0, value 1: [[1, 'yes'], [1, 'yes'], [0, 'no']]
- Split index 1, value 0: [[1, 'no']]
-
- Process finished with exit code 0
现在我们知道了如何划分数据集,那么选择哪个属性进行划分才是最好的划分呢?我们应该优先选择能为我们带来最大信息增益的属性来划分数据集,因为这样划分后剩下的数据集将在最大程度上变得统一,能减少决策树的深度和结构。
假设数据集S有n个分类,C1,C2,...,Cn,按照第i个属性的所有值(vi1,vi2,...,vim)能划分出m个子集Si1,Si2,...,Sim,通过上面熵的计算方式可以得到这m个子集的熵分别为 Hi1,Hi2,...,Him,则这m个子集合并到一起的熵为:
其中, 为 集合
相对于原集合S的概率,即
中元素的个数除以S中元素的个数。
由于中的元素都没有第i个属性,所以其能够用于判断剩下的属性的熵,看其是否更加统一。依次计算按每个属性划分后的熵,选择信息增益最大的属性作为最好的数据划分。
实例:
创建模块 best_feature.py,输入以下代码:
- import decision_tree.entropy as entropy
- import decision_tree.split_data_set as sp
-
-
- def choose_best_feature_to_split(data_set):
- feature_size = len(data_set[0]) - 1
- base_entropy = entropy.cal_entropy(data_set)
- best_info_gain = 0.0
- best_feature = -1
- for feature_index in range(feature_size):
- # All element feature values on i axis, this is a set, so will remove duplicated value
- feature_set = set([element[feature_index] for element in data_set])
- feature_entropy = 0.0
- for feature_value in feature_set:
- sub_data_set = sp.split_data_set(data_set, feature_index, feature_value)
- prob = len(sub_data_set) / float(len(data_set))
- feature_entropy += prob * entropy.cal_entropy(sub_data_set)
- info_gain = base_entropy - feature_entropy
- # print("Info gain on feature %r is %r" % (feature_index, info_gain))
- if info_gain > best_info_gain:
- best_info_gain = info_gain
- best_feature = feature_index
- return best_feature
-
-
- if __name__ == "__main__":
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- best_feature = choose_best_feature_to_split(data_set)
- print("The best feature is: %r" % best_feature)

运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/best_feature.py
- The best feature is: 0
-
- Process finished with exit code 0
输出为0,说明在当前数据集中,按照第一个属性的划分是最好的划分。
可以用递归的方式依次找出给定集合中的最优划分属性,然后根据该属性将给定数据集划分成若干个子集,再找出每个子集中的最优划分属性进行子集划分。递归退出有两个条件,满足一个即可,1)子集中所有类别都相同,2)检查完所有属性,如果此时存在不同类别,则选择类别最多的作为最终类别。
实例:
创建模块 create_tree.py,输入以下代码:
- import operator
-
- import decision_tree.best_feature as bf
- import decision_tree.split_data_set as sp
-
-
- def majority_count(class_list):
- class_count = {}
- for value in class_list:
- if value not in class_count.keys():
- class_count[value] = 0
- class_count[value] += 1
- sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
- return sorted_class_count[0][0]
-
-
- def create_decision_tree(data_set, labels):
- class_list = [element[-1] for element in data_set]
- if class_list.count(class_list[0]) == len(class_list):
- return class_list[0]
- if len(data_set[0]) == 1:
- return majority_count(class_list)
- best_feature = bf.choose_best_feature_to_split(data_set)
- best_feature_label = labels[best_feature]
- tree = {best_feature_label: {}}
- del (labels[best_feature])
- feature_set = set([element[best_feature] for element in data_set])
- for feature in feature_set:
- sub_labels = labels[:]
- sub_data_set = sp.split_data_set(data_set, best_feature, feature)
- tree[best_feature_label][feature] = create_decision_tree(sub_data_set, sub_labels)
- return tree
-
-
- if __name__ == "__main__":
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- labels = ['A', 'B']
- tree = create_decision_tree(data_set, labels)
- print(tree)

注:A表示第一个属性的名称是A,B表示第二个属性的名称是B。
运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/create_tree.py
- {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
-
- Process finished with exit code 0
输出的决策树是一个dict格式,key代表的是label和该属性的所有值。注意,这里的决策树是普通的树,不是二叉树,因为任何属性都可能有多个值。
可以使用 matplotlib 库画出决策树的图像,便于直观理解。
实例:
针对上面的决策树 {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}},我们画出其图形。创建模块 plot_tree.py,并输入以下代码:
- import matplotlib.pyplot as plt
-
- decisionNode = dict(boxstyle="sawtooth", fc="0.8")
- leafNode = dict(boxstyle="round4", fc="0.8")
- arrow_args = dict(arrowstyle="<-")
-
-
- def get_num_leafs(tree):
- num_leafs = 0
- first_str = list(tree.keys())[0]
- second_dict = tree[first_str]
- for key in second_dict.keys():
- if type(second_dict[
- key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
- num_leafs += get_num_leafs(second_dict[key])
- else:
- num_leafs += 1
- return num_leafs
-
-
- def get_tree_depth(tree):
- max_depth = 0
- first_str = list(tree.keys())[0]
- second_dict = tree[first_str]
- for key in second_dict.keys():
- if type(second_dict[
- key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
- this_depth = 1 + get_tree_depth(second_dict[key])
- else:
- this_depth = 1
- if this_depth > max_depth: max_depth = this_depth
- return max_depth
-
-
- def plot_node(node_txt, center_pt, parent_pt, node_type):
- create_plot.ax1.annotate(node_txt, xy=parent_pt, xycoords='axes fraction',
- xytext=center_pt, textcoords='axes fraction',
- va="center", ha="center", bbox=node_type, arrowprops=arrow_args)
-
-
- def plot_mid_text(center_pt, parent_pt, txt_str):
- x_mid = (parent_pt[0] - center_pt[0]) / 2.0 + center_pt[0]
- y_mid = (parent_pt[1] - center_pt[1]) / 2.0 + center_pt[1]
- create_plot.ax1.text(x_mid, y_mid, txt_str, va="center", ha="center", rotation=30)
-
-
- def plot_tree(tree, parent_pt, node_txt): # if the first key tells you what feat was split on
- num_leafs = get_num_leafs(tree) # this determines the x width of this tree
- depth = get_tree_depth(tree)
- first_str = list(tree.keys())[0] # the text label for this node should be this
- cntr_pt = (plot_tree.xOff + (1.0 + float(num_leafs)) / 2.0 / plot_tree.totalW, plot_tree.yOff)
- plot_mid_text(cntr_pt, parent_pt, node_txt)
- plot_node(first_str, cntr_pt, parent_pt, decisionNode)
- second_dict = tree[first_str]
- plot_tree.yOff = plot_tree.yOff - 1.0 / plot_tree.totalD
- for key in second_dict.keys():
- if type(second_dict[
- key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
- plot_tree(second_dict[key], cntr_pt, str(key)) # recursion
- else: # it's a leaf node print the leaf node
- plot_tree.xOff = plot_tree.xOff + 1.0 / plot_tree.totalW
- plot_node(second_dict[key], (plot_tree.xOff, plot_tree.yOff), cntr_pt, leafNode)
- plot_mid_text((plot_tree.xOff, plot_tree.yOff), cntr_pt, str(key))
- plot_tree.yOff = plot_tree.yOff + 1.0 / plot_tree.totalD
-
-
- def create_plot(tree):
- fig = plt.figure(1, facecolor='white')
- fig.clf()
- axprops = dict(xticks=[], yticks=[])
- create_plot.ax1 = plt.subplot(111, frameon=False, **axprops) # no ticks
- # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
- plot_tree.totalW = float(get_num_leafs(tree))
- plot_tree.totalD = float(get_tree_depth(tree))
- plot_tree.xOff = -0.5 / plot_tree.totalW;
- plot_tree.yOff = 1.0;
- plot_tree(tree, (0.5, 1.0), '')
- plt.show()
-
-
- def retrieve_tree(i):
- list_of_tree = [{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}},
- {'A': {0: 'no', 1: {'B': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
- ]
- return list_of_tree[i]
-
-
- if __name__ == '__main__':
- tree = retrieve_tree(0)
- print(tree)
- create_plot(tree)
'运行
运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/plot_tree.py
- {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
构建完决策树以后,我们希望通过该决策树来判断给定数据的分类。方法是根据给定数据的属性值依次遍历决策树的节点,从而找到最后的分类。
实例:
创建模块 classify.py,输入以下代码:
- import decision_tree.create_tree as ct
-
-
- def classify(tree, feature_labels, test_vec):
- first_str = list(tree.keys())[0]
- second_dict = tree[first_str]
- feat_index = feature_labels.index(first_str)
- for value in second_dict.keys():
- if test_vec[feat_index] == value:
- if type(second_dict[value]).__name__ == 'dict':
- class_label = classify(second_dict[value], feature_labels, test_vec)
- else:
- class_label = second_dict[value]
- return class_label
-
-
- if __name__ == '__main__':
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- labels = ['A', 'B']
- tree = ct.create_decision_tree(data_set, ['A', 'B'])
- print(tree)
- obj_leabl = classify(tree, labels, [1, 0])
- print("The label of [1, 0] is %r" % obj_leabl)
- obj_leabl = classify(tree, labels, [1, 1])
- print("The label of [1, 1] is %r" % obj_leabl)
- obj_leabl = classify(tree, labels, [0, 1])
- print("The label of [0, 1] is %r" % obj_leabl)

运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/classify.py
- {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
- The label of [1, 0] is 'no'
- The label of [1, 1] is 'yes'
- The label of [0, 1] is 'no'
-
- Process finished with exit code 0
构建决策树的过程往往会消耗比较多的时间,我们可以将构建好的决策树存储到磁盘空间,需要使用的时候再加载,这样就可以不用重复构建决策树了。这可以通过Python内置模块 pickle来实现。
实例:
创建模块 store.py,并输入以下代码:
- import decision_tree.create_tree as ct
- import decision_tree.classify as cl
- import pickle
-
-
- def store_tree(tree, file_path):
- with open(file_path, 'wb') as f:
- pickle.dump(tree, f)
-
-
- def restore_tree(file_path):
- with open(file_path, 'rb') as f:
- return pickle.load(f)
-
-
- if __name__ == '__main__':
- data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
- tree = ct.create_decision_tree(data_set, ['A', 'B'])
- print(tree)
- store_tree(tree, './my_tree.data')
-
- tree = restore_tree('./my_tree.data')
- print(tree)
-
- obj_leabl = cl.classify(tree, ['A', 'B'], [1, 1])
- print("The label of [1, 1] is %r" % obj_leabl)

运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/store.py
- {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
- {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
- The label of [1, 1] is 'yes'
-
- Process finished with exit code 0
可以看出此时能够将构建好的决策树存储到文件 my_tree.data,并从该文件加载决策树。
我们目前讨论的决策树都基于ID3,该算法的优点是相对比较简单,便于实现。缺点是如果匹配项太多可能出现过渡匹配现象,其无法裁剪不必要的叶子节点。另外,ID3只能划分标称型数据,无法处理连续的数值数据。
最后,我们通过一个案例来综合演练决策树的实际应用,使用决策树来预测是否应该给患者配隐形眼镜,以及配什么材质的隐形眼镜。
数据集属性和取值:
属性 | 取值 |
age(年龄) | pre(小孩) young(年轻人) presbyopic(老年人) |
prescript(症状) | myopia(近视眼) hyperopia(远视眼) |
astigmatic(是否散光) | yes no |
tearRate(眼泪数量) | normal(正常) reduced(减少) |
注:括号内是中文说明,加粗黑体表示属性名和取值。
数据集分类:
分类 | 说明 |
no lenses | 不能使用隐形眼镜 |
soft | 使用软材质隐形眼镜 |
hard | 使用硬材质隐形眼镜 |
数据集保存到文件 lenses.txt,每列分别代表年龄,症状,是否散光,眼泪数量,分类:
- young myope no reduced no lenses
- young myope no normal soft
- young myope yes reduced no lenses
- young myope yes normal hard
- young hyper no reduced no lenses
- young hyper no normal soft
- young hyper yes reduced no lenses
- young hyper yes normal hard
- pre myope no reduced no lenses
- pre myope no normal soft
- pre myope yes reduced no lenses
- pre myope yes normal hard
- pre hyper no reduced no lenses
- pre hyper no normal soft
- pre hyper yes reduced no lenses
- pre hyper yes normal no lenses
- presbyopic myope no reduced no lenses
- presbyopic myope no normal no lenses
- presbyopic myope yes reduced no lenses
- presbyopic myope yes normal hard
- presbyopic hyper no reduced no lenses
- presbyopic hyper no normal soft
- presbyopic hyper yes reduced no lenses
- presbyopic hyper yes normal no lenses

创建模块 lenses.py,并输入以下代码:
- import decision_tree.create_tree as ct
- import decision_tree.plot_tree as pt
- import decision_tree.classify as cf
-
-
- def get_tree():
- with open('./lenses.txt') as f:
- lenses = [inst.strip().split('\t') for inst in f.readlines()]
- tree = ct.create_decision_tree(lenses, get_lense_labels())
- return tree
-
-
- def get_lense_labels():
- labels = ['age', 'prescript', 'astigmatic', 'tearRate']
- return labels
-
-
- def plot_tree():
- tree = get_tree()
- pt.create_plot(tree)
-
-
- def test_classify():
- tree = get_tree()
- print(tree)
- labels = get_lense_labels()
- test_data = ['young', 'hyper', 'yes', 'normal']
- result = cf.classify(tree, labels, test_data)
- print("The label of data %r is %r" % (test_data, result))
-
-
- if __name__ == '__main__':
- test_classify()
- plot_tree()

运行结果:
- D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/lenses.py
- {'tearRate': {'normal': {'astigmatic': {'no': {'age': {'pre': 'soft', 'presbyopic': {'prescript': {'myope': 'no lenses', 'hyper': 'soft'}}, 'young': 'soft'}}, 'yes': {'prescript': {'myope': 'hard', 'hyper': {'age': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}}}}}, 'reduced': 'no lenses'}}
- The label of data ['young', 'hyper', 'yes', 'normal'] is 'hard'
可以看出,此时能够正确的判断是否应该给患者配隐形眼镜,以及用什么材质的隐形眼镜。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。