赞
踩
Tid | Refund | Marital Status | Taxable Income | Cheat |
---|---|---|---|---|
1 | yes | single | 125k | no |
2 | no | married | 100k | no |
3 | no | single | 70k | no |
4 | yes | married | 120k | no |
5 | no | divorced | 95k | yes |
6 | no | married | 60k | no |
7 | yes | divorced | 220k | no |
8 | no | single | 85k | yes |
9 | no | married | 75k | no |
10 | no | single | 90k | yes |
一开始不知道要从哪里开始下手,自己开始逐步摸索,首先将老师给的数据存到了Excel表格中,之后通过pandas的read_excel()函数读入,并拿到了各属性列的数据,但是之后就没大有思路了(怪自己没有将本次实验内容与老师上课所讲内容好好结合),于是在百度上搜索自编程实现决策树,发现可以用ID3算法实现,于是就去仔细看了下ID3算法是什么,发现就是老师上课讲的内容根据信息熵计算信息增益,然后选取信息增益最大的进行分类,重复此过程!ID3算法数值分析过程这篇博客讲的过程十分清晰,之后开始着手写信息熵和信息增益的计算
import pandas as pd import numpy as np def get_entropy(data, name): # 找出该属性列的唯一值 data_items = data[name].unique().tolist() entropy_items = 0 for item in data_items: # 对每个不同item属性值求信息熵 data_item = data[data[name] == item] sums_item_no = data_item[data_item['Cheat'] == 'no'].shape[0] sums_item_yes = data_item[data_item['Cheat'] == 'yes'].shape[0] sums_item_no_p = sums_item_no / (sums_item_no + sums_item_yes) sums_item_yes_p = sums_item_yes / (sums_item_no + sums_item_yes) # 计算不同item属性值的信息熵 if sums_item_no_p == 0 or sums_item_yes_p == 0: # 这里要处理子数据集为空的情况;这里暂未处理 entropy_item = 0 else: entropy_item = -np.log2(sums_item_no_p) * sums_item_no_p - np.log2(sums_item_yes_p) * sums_item_yes_p # 计算改item属性值所占概率 item_p = data_item.shape[0] / sums # 计算信息增益 entropy_items += item_p * entropy_item return entropy_items if __name__ == '__main__': inputfile = 'D:\shujuwajue\data.xls' data = pd.read_excel(inputfile, index_col=u'Tid') # 找出各属性列的唯一属性值 refunds = data['Refund'].unique().tolist() print(refunds) marital_status = data['Marital Status'].unique().tolist() print(marital_status) taxable_income = data['Taxable Income'].unique().tolist() print(taxable_income) cheat = data['Cheat'].unique().tolist() print(cheat) # 总记录数 sums = data.shape[0] print(sums) # 结果Cheat为no、yes的记录数 sums_no = data[data
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。