赞
踩
决策树是最经常使用的数据挖掘算法,其核心是一个贪心算法,它采用自顶向下的递归方法构建决策树,下面是一个典型的决策树:
目前常用的决策树算法有ID3算法、改进的C4.5,C5.0算法和CART算法
ID3算法的核心是在决策树各级节点上选择属性时,用信息增益作为属性的选择标准,使得在每一个非节点进行测试时,能获得关于被测试记录最大的类别信息。
ID3的特点
优点:理论清晰,方法简单,学习能力较强
缺点:
(1) 信息增益的计算比较依赖于特征数目比较多的特征
(2) ID3为非递增算法
(3) ID3为单变量决策树
(4) 抗糙性差
熵和信息增益
设S是训练样本集,它包括n个类别的样本,这些方法用
C
i
{C_i}
Ci表示,那么熵和信息增益用下面公式表示:
信息熵:
E
(
S
)
=
−
∑
i
=
0
n
p
i
log
2
p
i
E(S) = - \sum\limits_{i = 0}^n {{p_i}{{\log }_2}{p_i}}
E(S)=−i=0∑npilog2pi
其中
p
i
{p_i}
pi表示
C
i
{C_i}
Ci的概率
样本熵:
E
A
(
S
)
=
−
∑
j
=
1
m
∣
S
j
∣
∣
S
∣
E
(
S
j
)
E{_A}(S) = - \sum\limits_{j = 1}^m {\frac{{|{S_j}|}}{{|S|}}E({S_j})}
EA(S)=−j=1∑m∣S∣∣Sj∣E(Sj)
其中
S
i
{S_i}
Si表示根据属性A划分的
S
{S}
S的第
i
{i}
i个子集,
S
{S}
S和
S
i
{S_i}
Si表示样本数目
信息增益:
G
a
i
n
(
S
,
A
)
=
E
(
S
)
−
E
A
(
S
)
Gain(S,A) = E(S) - {E_A}(S)
Gain(S,A)=E(S)−EA(S)
ID3中样本分布越均匀,它的信息熵就越大,所以其原则就是样本熵越小越好,也就是信息增益越大越好。
算法实例分析
outlook | tem | hum | windy | play |
---|---|---|---|---|
overcast | hot | high | not | no |
overcast | hot | high | very | no |
overcast | hot | high | medium | no |
sunny | hot | high | not | yes |
sunny | hot | high | medium | yes |
rain | mild | high | not | no |
rain | mild | high | medium | no |
rain | hot | normal | not | yes |
rain | cool | normal | medium | no |
rain | hot | normal | very | no |
sunny | cool | normal | very | yes |
sunny | cool | normal | medium | yes |
overcast | mild | high | not | no |
overcast | mild | high | medium | no |
overcast | cool | normal | not | yes |
overcast | cool | normal | medium | yes |
rain | mild | normal | not | no |
rain | mild | normal | medium | no |
overcast | mild | normal | medium | yes |
overcast | hot | normal | very | yes |
sunny | mild | high | very | yes |
sunny | mild | high | medium | yes |
sunny | hot | normal | not | yes |
rain | mild | high | very | no |
在上面的样本中,属于
y
e
s
{yes}
yes的结果有12个,
n
o
{no}
no有12个,于是根据上面的公式算出来训练集的熵为:
E
(
S
)
=
−
12
24
log
2
12
24
−
12
24
log
2
12
24
=
1
E(S) = - \frac{12}{{24}}{\log _2}\frac{12}{{24}} - \frac{12}{{24}}{\log _2}\frac{12}{{24}} = 1
E(S)=−2412log22412−2412log22412=1
下面对属性outlook 、tem 、hum 、windy 计算对应的信息增益。
E
(
S
,
o
u
t
l
o
o
k
)
=
7
24
(
−
7
7
log
2
7
7
−
0
)
+
9
24
(
−
4
9
log
2
4
9
−
5
9
log
2
5
9
)
+
8
24
(
−
1
8
log
2
1
8
−
7
8
log
2
7
8
)
=
0.5528
E(S,outlook) = \frac{7}{{24}}( - \frac{7}{7}{\log _2}\frac{7}{7} - 0) + \frac{9}{{24}}( - \frac{4}{9}{\log _2}\frac{4}{9} - \frac{5}{9}{\log _2}\frac{5}{9}) + \frac{8}{{24}}( - \frac{1}{8}{\log _2}\frac{1}{8} - \frac{7}{8}{\log _2}\frac{7}{8}) = 0.5528
E(S,outlook)=247(−77log277−0)+249(−94log294−95log295)+248(−81log281−87log287)=0.5528
G
a
i
n
(
S
,
o
u
t
l
o
o
k
)
=
1
−
0.5528
=
0.4472
Gain(S,outlook) = 1 - 0.5528= 0.4472
Gain(S,outlook)=1−0.5528=0.4472
同理:
E
(
S
,
t
e
m
)
=
0.8893
E(S,tem) =0.8893
E(S,tem)=0.8893
G
a
i
n
(
S
,
t
e
m
)
=
0.1107
Gain(S,tem) = 0.1107
Gain(S,tem)=0.1107
E
(
S
,
h
u
m
)
=
0.9183
E(S,hum) =0.9183
E(S,hum)=0.9183
G
a
i
n
(
S
,
h
u
m
)
=
0.0817
Gain(S,hum) = 0.0817
Gain(S,hum)=0.0817
E
(
S
,
w
i
n
d
y
)
=
1
E(S,windy) =1
E(S,windy)=1
G
a
i
n
(
S
,
w
i
n
d
y
)
=
0
Gain(S,windy) = 0
Gain(S,windy)=0
从上面可以看出outlook的信息增益最大,所以选择outlook作为根节点的测试属性,windy的信息增益为0,不能做出任何分类信息,产生第一次决策树,然后对每个叶节点再次利用上面的过程,生成最终的决策树。
def createDataSet(): #outlook: 0 rain 1 overcast 2 sunny #tem: 0 cool 1 mild 2 hot #hum: 0 normal 1 high #windy 0 not 1 medium 2 very dataSet = [[1, 2, 1, 0, 'no'], [1, 2, 1, 2, 'no'], [1, 2, 1, 1, 'no'], [2, 2, 1, 0, 'yes'], [2, 2, 1, 1, 'yes'], [0, 1, 1, 0, 'no'], [0, 1, 1, 1, 'no'], [0, 2, 0, 0, 'yes'], [0, 0, 0, 1, 'no'], [0, 2, 0, 2, 'no'], [2, 0, 0, 2, 'yes'], [2, 0, 0, 1, 'yes'], [1, 1, 1, 0, 'no'], [1, 1, 1, 1, 'no'], [1, 0, 0, 0, 'yes'], [1, 0, 0, 1, 'yes'], [0, 1, 0, 0, 'no'], [0, 1, 0, 1, 'no'], [1, 1, 0, 1, 'yes'], [1, 2, 0, 2, 'yes'], [2, 1, 1, 2, 'yes'], [2, 1, 1, 1, 'yes'], [2, 2, 0, 0, 'yes'], [0, 1, 1, 2, 'no'],] return dataSet
def calcShannonEnt(dataSet): numEntries = len(dataSet) #获取数据的数目 labelCounts = {} for featVec in dataSet: currentLable = featVec[-1] #取得最后一列数据,做为标签 if currentLable not in labelCounts.keys(): #获取不同标签的数目 labelCounts[currentLable] = 0 labelCounts[currentLable] += 1 #计算熵 Ent = 0.0 for key in labelCounts: prob = float(labelCounts[key]) / numEntries Ent -= prob * log(prob, 2) #print ("信息熵: ", Ent) return Ent
按照给定特征划分数据集
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value: #每行中第axis个元素和value相等(去除第axis个数据)
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet #返回分类后的新矩阵
选择最好的数据集划分方式
#根据香农熵,选择最优的划分方式 #根据某一属性划分后,类标签香农熵越低,效果越好 def chooseBestFeatureToSplit(dataSet): baseEntropy = calcShannonEnt(dataSet) #计算数据集的香农熵 numFeatures = len(dataSet[0]) - 1 bestInfoGain = 0.0 #最大信息增益 bestFeature = 0 #最优特征 for i in range(0, numFeatures): featList = [example[i] for example in dataSet] #所有子列表(每行)的第i个元素,组成一个新的列表 uniqueVals = set(featList) newEntorpy = 0.0 for value in uniqueVals: #数据集根据第i个属性进行划分,计算划分后数据集的香农熵 subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntorpy += prob*calcShannonEnt(subDataSet) infoGain = baseEntropy-newEntorpy #划分后的数据集,香农熵越小越好,即信息增益越大越好 if(infoGain > bestInfoGain): bestInfoGain = infoGain bestFeature = i return bestFeature
#如果数据集已经处理了所有属性,但叶子结点中类标签依然不是唯一的,此时需要决定如何定义该叶子结点。这种情况下,采用多数表决方法,对该叶子结点进行分类 def majorityCnt(classList): classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def createTree(dataSet, labels): #创建树 classList = [example[-1] for example in dataSet] #数据集样本的类标签 if classList.count(classList[0]) == len(classList): #如果数据集样本属于同一类,说明该叶子结点划分完毕 return classList[0] if len(dataSet[0]) == 1: #如果数据集样本只有一列(该列是类标签),说明所有属性都划分完毕,则根据多数表决方法,对该叶子结点进行分类 return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) #根据香农熵,选择最优的划分方式 bestFeatLabel = labels[bestFeat] #记录该属性标签 myTree = {bestFeatLabel:{}} #树 del(labels[bestFeat]) #在属性标签中删除该属性 #根据最优属性构建树 featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] subDataSet = splitDataSet(dataSet, bestFeat, value) myTree[bestFeatLabel][value] = createTree(subDataSet, subLabels) print ("myTree:", myTree) return myTree
#测试算法:使用决策树,对待分类样本进行分类
def classify(inputTree, featLabels, testVec): #传入参数:决策树,属性标签,待分类样本
firstStr = list(inputTree.keys())[0] #树根代表的属性
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr) #树根代表的属性,所在属性标签中的位置,即第几个属性
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel
if __name__ == '__main__':
dataSet = createDataSet()
labels = ['outlook', 'tem', 'hum', 'windy']
labelsForCreateTree = labels[:]
Tree = createTree(dataSet, labelsForCreateTree )
testvec = [2, 2, 1, 0]
print (classify(Tree, labels, testvec))
最终产生的决策树为:
4. 算法实例分析
决策树有分类和回归两种功能,分别调用下面的两个库函数,如下所示:
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # 分类 from sklearn.tree import DecisionTreeRegressor # 回归 iris = datasets.load_iris() # 加载iris数据集 X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) clf = DecisionTreeClassifier() clf.fit(X_train, y_train) ans = clf.predict(X_test) # 计算准确率 cnt = 0 for i in range(len(y_test)): if ans[i] - y_test[i] < 1e-1: cnt += 1 print("准确率: ", (cnt * 100.0 / len(y_test)),"%")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。