赞
踩
信息增益作为决策树的节点决策基础,因为训练集不涉及到连续值,本文采取的是ID3算法
在信息增益中,衡量标准是看特征能够为分类系统带来多少信息,带来的信息越多,该特征越重要。对一个特征而言,系统有它和没它时信息量将发生变化,而前后信息量的差值就是这个特征给系统带来的信息量。所谓信息量,就是熵。假如有变量X,其可能的取值有n种,每一种取到的概率为Pi,那么X的熵就定义为:
(1)根节点的信息熵代码实现
- from math import log
- def CalRootEntropy(dataset):#计算根节点的熵
- Num=len(dataset)
- labels={}
- for i in dataset:
- currentlabel=i[-1]
- if currentlabel not in labels.keys():
- labels[currentlabel]=0
- labels[currentlabel]+=1
- RootEntropy=0.0
- for i in labels:
- Cali=float(labels[i])/Num
- RootEntropy-=Cali*log(Cali,2)
- return RootEntropy
(2)按照某个特征值分割数据集,列表存储
- def splitDataset(dataset,axis,value):
- ReturnList=[]
- for i in (dataset):
- if i[axis]==value:
- ReturnSingle=i[:axis]
- ReturnSingle.extend(i[axis+1:])
- ReturnList.append(ReturnList)
- return ReturnList
(3)选择决策树的节点分支
- def choose(dataset):
- NumOfFeature=len(dataset[0])-1
- RootEntropy=CalRootEntropy(dataset)
- BestP=0.0
- bestfeature=-1
- for i in range(NumOfFeature):
- ListLabel=[L[i] for L in dataset]
- UniqLabel=set(ListLabel)
- newEntropy=0.0
- for j in UniqLabel:
- SD=splitDataset(dataset, i, j)
- prob=len(SD)/float(len(dataset))
- newEntropy+=prob*CalRootEntropy(SD)
- BestHA=RootEntropy-newEntropy
- if(BestHA>BestP):
- BestP=BestHA
- bestfeature=i
- return bestfeature
(4)叶节点的多数决策方法
- def major(classlist):
- classcount={}
- for i in classlist:
- if i not in classcount.keys():
- classcount[i]=0
- classcount[i]+=1
- sortedClass=sorted(classcount.iteritems(),key=operator.itemgetter[1],reverse=True)
- return sortedClass[0][0]
(5)决策树的最终构建
这段代码比较高深,涉及到字典嵌套和迭代函数
- def createTree(dataset,labels):
- classList=[i[-1] for i in dataset]
- if classList.count(classList[0]==len(classList)):
- return classList[0]
- if len(dataset[0])==1:
- return major(classList)
- bestFeat=choose(dataset)
- bestFeatLabel=labels[bestFeat]
- myTree={bestFeatLabel:{}}
- del(labels[bestFeat])
- featValues=[E[bestFeat] for E in dataset]
- Uniq=set(featValues)
- for value in Uniq:
- subLabels=labels[:]
- myTree[bestFeatLabel][value]=createTree(splitDataset(dataset, bestFeat, value), subLabels)
- return myTree
上述完成了决策树的构建过程,ID3算法存在着很多问题,最常见的就是过拟合,C4.5以及CART算法对其进行了改进。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。