小丑西瓜9

这个屌丝很懒，什么也没留下！

热门标签

随机森林random forest及python实现_python 随机森林回归留一法交叉验证

作者：小丑西瓜9 | 2024-02-22 20:59:35

踩

python 随机森林回归留一法交叉验证

引言

随机森林能够用来获取数据的主要特征，进行分类、回归任务。

1. 随机森林及其特点

根据个体学习器的生成方式，目前的集成学习方法大致可分为两大类，即个体学习器之间存在强依赖关系，必须串行生成的序列化方法，以及个体学习器间不存在强依赖关系，可同时生成的并行化方法；前者的代表是Boosting，后者的代表是Bagging。

随机森林在以决策树为基学习器构建Bagging集成的基础上，进一步在决策树的训练过程中引入了随机属性选择（即引入随机特征选择）。

简单来说，随机森林就是对决策树的Bagging集成。

特点：
1、随机选择样本（放回抽样）；
2、随机选择特征；
3、构建决策树；
4、随机森林投票（平均）。

举例：

比如预测salary，就是构建多个决策树job，age，house，然后根据要预测的量的各个特征（job = teacher，age = 39，house = suburb）分别在对应决策树的目标值概率（ $P (s a l a r y < 5000 ∣ j o b = t e a c h e r)$ ），从而确定预测量的发生概率（如最终预测出 $P (s a l a r y < 5000) = 0.3$ ）.

随机森林参数说明：

最主要的两个参数是n_estimators和max_features。

1.n_estimators：表示森林里树的个数。

理论上是越大越好，但是计算时间也相应增长。所以，并不是取得越大就会越好，预测效果最好的将会出现在合理的树个数。

2.max_features：每个决策树的随机选择的特征数目。

每个决策树在随机选择的这max_features特征里找到某个“最佳”特征，使得模型在该特征的某个值上分裂之后得到的收益最大化。max_features越少，方差就会减少，但同时偏差就会增加。
如果是回归问题，则max_features＝n_features，如果是分类问题，则max_features＝sqrt(n_features)，其中，n_features 是输入特征数。

其他参数：

3.max_depth: 树的最深深度。

如果max_depth＝None，节点会拟合到增益为0，或者所有的叶节点含有小于min_samples_split个样本。如果同时min_sample_split=1，决策树会拟合得很深，甚至会过拟合。

4.bootstrap：自助法，默认为True。

如果bootstrap==True，将每次有放回地随机选取样本。
只有在extra-trees中，bootstrap=False。

Extra trees,Extremely Randomized Trees，指极度随机树，和随机森林区别是：

1、随机森林应用的是Bagging模型，而ET是使用所有的训练样本得到每棵决策树，也就是每棵决策树应用的是相同的全部训练样本；

2、随机森林是在一个随机子集内得到最佳分叉属性，而ET是完全随机的得到分叉值，从而实现对决策树进行分叉的。

训练随机森林时，建议使用cross_validated（交叉验证），把数据n等份，每次取其中一份当验证集，其余数据训练随机森林，并用于预测测试集。最终得到n个结果，并平均得到最终结果。

2、随机森林python实现

2.1 随机森林回归器的使用Demo1

实现随机森林基本功能

#随机森林

from sklearn.tree import DecisionTreeRegressor  
from sklearn.ensemble import RandomForestRegressor  
import numpy as np  
   
from sklearn.datasets import load_iris  
iris=load_iris()  
#print iris#iris的４个属性是：萼片宽度　萼片长度　花瓣宽度　花瓣长度　标签是花的种类：setosa versicolour virginica  
print(iris['target'].shape)
rf=RandomForestRegressor()#这里使用了默认的参数设置  
rf.fit(iris.data[:150],iris.target[:150])#进行模型的训练  

#随机挑选两个预测不相同的样本  
instance=iris.data[[100,109]]  
print(instance)
rf.predict(instance[[0]])
print('instance 0 prediction；',rf.predict(instance[[0]]))
print( 'instance 1 prediction；',rf.predict(instance[[1]]))
print(iris.target[100],iris.target[109])  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

运行结果

(150,)
[[ 6.3  3.3  6.   2.5]
 [ 7.2  3.6  6.1  2.5]]
instance 0 prediction； [ 2.]
instance 1 prediction； [ 2.]
2 2
1
2
3
4
5
6

2.2 随机森林分类器、决策树、extra树分类器的比较Demo2

3种方法的比较

#random forest test

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())                             


clf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())                             

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

运行结果：

0.979408793821 #DecisionTreeClassifier
0.999607843137 #RandomForestClassifier
0.999898989899 #ExtraTreesClassifier
1
2
3

2.3 随机森林回归器regressor-实现特征选择

#随机森林2
from sklearn.tree import DecisionTreeRegressor  
from sklearn.ensemble import RandomForestRegressor  
import numpy as np  
   
from sklearn.datasets import load_iris  
iris=load_iris()  

from sklearn.model_selection import cross_val_score, ShuffleSplit  
X = iris["data"]  
Y = iris["target"]  
names = iris["feature_names"]  

rf = RandomForestRegressor()  
scores = []  
for i in range(X.shape[1]):  
     score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",  
                              cv=ShuffleSplit(len(X), 3, .3))  
     scores.append((round(np.mean(score), 3), names[i]))  
     
print(sorted(scores, reverse=True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

运行结果：

[(0.89300000000000002, 'petal width (cm)'), (0.82099999999999995, 'petal length
(cm)'), (0.13, 'sepal length (cm)'), (-0.79100000000000004, 'sepal width (cm)')]
1
2

2.4 demo4-随机森林
本来想利用以下代码来构建随机随机森林决策树，但是，遇到的问题是，程序一直在运行，无法响应，还需要调试。

#随机森林4
#coding:utf-8  
import csv  
from random import seed  
from random import randrange  
from math import sqrt  
  
def loadCSV(filename):#加载数据，一行行的存入列表  
    dataSet = []  
    with open(filename, 'r') as file:  
        csvReader = csv.reader(file)  
        for line in csvReader:  
            dataSet.append(line)  
    return dataSet  
  
# 除了标签列，其他列都转换为float类型  
def column_to_float(dataSet):  
    featLen = len(dataSet[0]) - 1  
    for data in dataSet:  
        for column in range(featLen):  
            data[column] = float(data[column].strip())  
  
# 将数据集随机分成N块，方便交叉验证，其中一块是测试集，其他四块是训练集  
def spiltDataSet(dataSet, n_folds):  
    fold_size = int(len(dataSet) / n_folds)  
    dataSet_copy = list(dataSet)  
    dataSet_spilt = []  
    for i in range(n_folds):  
        fold = []  
        while len(fold) < fold_size:  # 这里不能用if，if只是在第一次判断时起作用，while执行循环，直到条件不成立  
            index = randrange(len(dataSet_copy))  
            fold.append(dataSet_copy.pop(index))  # pop() 函数用于移除列表中的一个元素（默认最后一个元素），并且返回该元素的值。  
        dataSet_spilt.append(fold)  
    return dataSet_spilt  
  
# 构造数据子集  
def get_subsample(dataSet, ratio):  
    subdataSet = []  
    lenSubdata = round(len(dataSet) * ratio)#返回浮点数  
    while len(subdataSet) < lenSubdata:  
        index = randrange(len(dataSet) - 1)  
        subdataSet.append(dataSet[index])  
    # print len(subdataSet)  
    return subdataSet  
  
# 分割数据集  
def data_spilt(dataSet, index, value):  
    left = []  
    right = []  
    for row in dataSet:  
        if row[index] < value:  
            left.append(row)  
        else:  
            right.append(row)  
    return left, right  
  
# 计算分割代价  
def spilt_loss(left, right, class_values):  
    loss = 0.0  
    for class_value in class_values:  
        left_size = len(left)  
        if left_size != 0:  # 防止除数为零  
            prop = [row[-1] for row in left].count(class_value) / float(left_size)  
            loss += (prop * (1.0 - prop))  
        right_size = len(right)  
        if right_size != 0:  
            prop = [row[-1] for row in right].count(class_value) / float(right_size)  
            loss += (prop * (1.0 - prop))  
    return loss  
  
# 选取任意的n个特征，在这n个特征中，选取分割时的最优特征  
def get_best_spilt(dataSet, n_features):  
    features = []  
    class_values = list(set(row[-1] for row in dataSet))  
    b_index, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None  
    while len(features) < n_features:  
        index = randrange(len(dataSet[0]) - 1)  
        if index not in features:  
            features.append(index)  
    # print 'features:',features  
    for index in features:#找到列的最适合做节点的索引，（损失最小）  
        for row in dataSet:  
            left, right = data_spilt(dataSet, index, row[index])#以它为节点的，左右分支  
            loss = spilt_loss(left, right, class_values)  
            if loss < b_loss:#寻找最小分割代价  
                b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right  
    # print b_loss  
    # print type(b_index)  
    return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right}  
  
# 决定输出标签  
def decide_label(data):  
    output = [row[-1] for row in data]  
    return max(set(output), key=output.count)  
  
  
# 子分割，不断地构建叶节点的过程对对对  
def sub_spilt(root, n_features, max_depth, min_size, depth):  
    left = root['left']  
    # print left  
    right = root['right']  
    del (root['left'])  
    del (root['right'])  
    # print depth  
    if not left or not right:  
        root['left'] = root['right'] = decide_label(left + right)  
        # print 'testing'  
        return  
    if depth > max_depth:  
        root['left'] = decide_label(left)  
        root['right'] = decide_label(right)  
        return  
    if len(left) < min_size:  
        root['left'] = decide_label(left)  
    else:  
        root['left'] = get_best_spilt(left, n_features)  
        # print 'testing_left'  
        sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1)  
    if len(right) < min_size:  
        root['right'] = decide_label(right)  
    else:  
        root['right'] = get_best_spilt(right, n_features)  
        # print 'testing_right'  
        sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1)  
  
        # 构造决策树  
def build_tree(dataSet, n_features, max_depth, min_size):  
    root = get_best_spilt(dataSet, n_features)  
    sub_spilt(root, n_features, max_depth, min_size, 1)  
    return root  
# 预测测试集结果  
def predict(tree, row):  
    predictions = []  
    if row[tree['index']] < tree['value']:  
        if isinstance(tree['left'], dict):  
            return predict(tree['left'], row)  
        else:  
            return tree['left']  
    else:  
        if isinstance(tree['right'], dict):  
            return predict(tree['right'], row)  
        else:  
            return tree['right']  
            # predictions=set(predictions)  
def bagging_predict(trees, row):  
    predictions = [predict(tree, row) for tree in trees]  
    return max(set(predictions), key=predictions.count)  
# 创建随机森林  
def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees):  
    trees = []  
    for i in range(n_trees):  
        train = get_subsample(train, ratio)#从切割的数据集中选取子集  
        tree = build_tree(train, n_features, max_depth, min_size)  
        # print 'tree %d: '%i,tree  
        trees.append(tree)  
    # predict_values = [predict(trees,row) for row in test]  
    predict_values = [bagging_predict(trees, row) for row in test]  
    return predict_values  
# 计算准确率  
def accuracy(predict_values, actual):  
    correct = 0  
    for i in range(len(actual)):  
        if actual[i] == predict_values[i]:  
            correct += 1  
    return correct / float(len(actual))  
  
  
if __name__ == '__main__':  
    seed(1)  
    dataSet = loadCSV(r'G:\训练小样本2.csv')  
    column_to_float(dataSet)  
    n_folds = 5  
    max_depth = 15  
    min_size = 1  
    ratio = 1.0  
    # n_features=sqrt(len(dataSet)-1)  
    n_features = 15  
    n_trees = 10  
    folds = spiltDataSet(dataSet, n_folds)#先是切割数据集  
    scores = []  
    for fold in folds:  
   		 # 此处不能简单地用train_set=folds，这样用属于引用,那么当train_set的值改变的时候，folds的值也会改变，所以要用复制的形式。
   		 #（L[:]）能够复制序列，D.copy() 能够复制字典，list能够生成拷贝 list(L)  
        train_set = folds[:]  
        train_set.remove(fold)#选好训练集  
        train_set = sum(train_set, [])  # 将多个fold列表组合成一个train_set列表  
        test_set = []  
        for row in fold:  
            row_copy = list(row)  
            row_copy[-1] = None  
            test_set.append(row_copy)  
            # for row in test_set:  
            # print row[-1]  
        actual = [row[-1] for row in fold]  
        predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees)  
        accur = accuracy(predict_values, actual)  
        scores.append(accur)  
    print ('Trees is %d' % n_trees)  
    print ('scores:%s' % scores)  
    print ('mean score:%s' % (sum(scores) / float(len(scores))))  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

2.5 随机森林分类sonic data

# CART on the Bank Note dataset
from random import seed
from random import randrange
from csv import reader

# Load a CSV file
def load_csv(filename):
	file = open(filename, "r")
	lines = reader(file)
	dataset = list(lines)
	return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores

# Split a data set based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		if row[index] < value:
			left.append(row)
		else:
			right.append(row)
	return left, right

# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
	gini = 0.0
	for class_value in class_values:
		for group in groups:
			size = len(group)
			if size == 0:
				continue
			proportion = [row[-1] for row in group].count(class_value) / float(size)
			gini += (proportion * (1.0 - proportion))
	return gini

# Select the best split point for a dataset
def get_split(dataset):
	class_values = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None
	for index in range(len(dataset[0])-1):
		for row in dataset:
			groups = test_split(index, row[index], dataset)
			gini = gini_index(groups, class_values)
			if gini < b_score:
				b_index, b_value, b_score, b_groups = index, row[index], gini, groups
	print ({'index':b_index, 'value':b_value})
	return {'index':b_index, 'value':b_value, 'groups':b_groups}

# Create a terminal node value
def to_terminal(group):
	outcomes = [row[-1] for row in group]
	return max(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
	left, right = node['groups']
	del(node['groups'])
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = to_terminal(left + right)
		return
	# check for max depth
	if depth >= max_depth:
		node['left'], node['right'] = to_terminal(left), to_terminal(right)
		return
	# process left child
	if len(left) <= min_size:
		node['left'] = to_terminal(left)
	else:
		node['left'] = get_split(left)
		split(node['left'], max_depth, min_size, depth+1)
	# process right child
	if len(right) <= min_size:
		node['right'] = to_terminal(right)
	else:
		node['right'] = get_split(right)
		split(node['right'], max_depth, min_size, depth+1)

# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)
	split(root, max_depth, min_size, 1)
	return root

# Make a prediction with a decision tree
def predict(node, row):
	if row[node['index']] < node['value']:
		if isinstance(node['left'], dict):
			return predict(node['left'], row)
		else:
			return node['left']
	else:
		if isinstance(node['right'], dict):
			return predict(node['right'], row)
		else:
			return node['right']

# Classification and Regression Tree Algorithm
def decision_tree(train, test, max_depth, min_size):
	tree = build_tree(train, max_depth, min_size)
	predictions = list()
	for row in test:
		prediction = predict(tree, row)
		predictions.append(prediction)
	return(predictions)

# Test CART on Bank Note dataset
seed(1)
# load and prepare data
filename = r'G:\0pythonstudy\决策树\sonar.all-data.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(len(dataset[0])-1):
	str_column_to_float(dataset, i)
# evaluate algorithm
n_folds = 5
max_depth = 5
min_size = 10
scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

运行结果：

{'index': 38, 'value': 0.0894}
{'index': 36, 'value': 0.8459}
{'index': 50, 'value': 0.0024}
{'index': 15, 'value': 0.0906}
{'index': 16, 'value': 0.9819}
{'index': 10, 'value': 0.0785}
{'index': 16, 'value': 0.0886}
{'index': 38, 'value': 0.0621}
{'index': 5, 'value': 0.0226}
{'index': 8, 'value': 0.0368}
{'index': 11, 'value': 0.0754}
{'index': 0, 'value': 0.0239}
{'index': 8, 'value': 0.0368}
{'index': 29, 'value': 0.1671}
{'index': 46, 'value': 0.0237}
{'index': 38, 'value': 0.0621}
{'index': 14, 'value': 0.0668}
{'index': 4, 'value': 0.0167}
{'index': 37, 'value': 0.0836}
{'index': 12, 'value': 0.0616}
{'index': 7, 'value': 0.0333}
{'index': 33, 'value': 0.8741}
{'index': 16, 'value': 0.0886}
{'index': 8, 'value': 0.0368}
{'index': 33, 'value': 0.0798}
{'index': 44, 'value': 0.0298}
Scores: [48.78048780487805, 70.73170731707317, 58.536585365853654, 51.2195121951
2195, 39.02439024390244]
Mean Accuracy: 53.659%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

知识点：
1.load CSV file

from csv import reader
# Load a CSV file
def load_csv(filename):
	file = open(filename, "r")
	lines = reader(file)
	dataset = list(lines)
	return dataset

filename = r'G:\0pythonstudy\决策树\sonar.all-data.csv'
dataset=load_csv(filename)
print(dataset)
1
2
3
4
5
6
7
8
9
10
11

2.把数据转化成float格式

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
    # print(row[column])

# convert string attributes to integers
for i in range(len(dataset[0])-1):
	str_column_to_float(dataset, i)
1
2
3
4
5
6
7
8
9

3.把最后一列的分类字符串转化成0、1整数

def str_column_to_int(dataset, column):
   class_values = [row[column] for row in dataset]#生成一个class label的list
   # print(class_values)
   unique = set(class_values)#set 获得list的不同元素
   print(unique)
   
   lookup = dict()#定义一个字典
   # print(enumerate(unique))
   for i, value in enumerate(unique):
       lookup[value] = i
   # print(lookup)
   for row in dataset:
       row[column] = lookup[row[column]]
   print(lookup['M'])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

4、把数据集分割成K份

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()#生成空列表
	
	dataset_copy = list(dataset)
	print(len(dataset_copy))
	print(len(dataset))
	#print(dataset_copy)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			# print(index)
			fold.append(dataset_copy.pop(index))#使用.pop()把里边的元素都删除（相当于转移），这k份元素各不相同。
		dataset_split.append(fold)
	return dataset_split

n_folds=5   
folds = cross_validation_split(dataset, n_folds)#k份元素各不相同的训练集
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

5.计算正确率

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0#这个是二值分类正确性的表达式
1
2
3
4
5
6
7

6.二分类每列

# Split a data set based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()#初始化两个空列表
	for row in dataset:
		if row[index] < value:
			left.append(row)
		else:
			right.append(row)
	return left, right #返回两个列表，每个列表以value为界限对指定行（index）进行二分类。
1
2
3
4
5
6
7
8
9

7.使用gini系数来获得最佳分割点

# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
	gini = 0.0
	for class_value in class_values:
		for group in groups:
			size = len(group)
			if size == 0:
				continue
			proportion = [row[-1] for row in group].count(class_value) / float(size)
			gini += (proportion * (1.0 - proportion))
	return gini

# Select the best split point for a dataset
def get_split(dataset):
	class_values = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None
	for index in range(len(dataset[0])-1):
		for row in dataset:
			groups = test_split(index, row[index], dataset)
			gini = gini_index(groups, class_values)
			if gini < b_score:
				b_index, b_value, b_score, b_groups = index, row[index], gini, groups
	# print(groups)
	print ({'index':b_index, 'value':b_value,'score':gini})
	return {'index':b_index, 'value':b_value, 'groups':b_groups}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

这段代码，在求gini指数，直接应用定义式，不难理解。获得最佳分割点可能比较难看懂，这里用了两层迭代，一层是对不同列的迭代，一层是对不同行的迭代。并且，每次迭代，都对gini系数进行更新。

8、决策树生成

# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
	left, right = node['groups']
	del(node['groups'])
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = to_terminal(left + right)
		return
	# check for max depth
	if depth >= max_depth:
		node['left'], node['right'] = to_terminal(left), to_terminal(right)
		return
	# process left child
	if len(left) <= min_size:
		node['left'] = to_terminal(left)
	else:
		node['left'] = get_split(left)
		split(node['left'], max_depth, min_size, depth+1)
	# process right child
	if len(right) <= min_size:
		node['right'] = to_terminal(right)
	else:
		node['right'] = get_split(right)
		split(node['right'], max_depth, min_size, depth+1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

这里使用了递归编程，不断生成左叉树和右叉树。

9.构建决策树

# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)
	split(root, max_depth, min_size, 1)
	return root	
	
tree=build_tree(train_set, max_depth, min_size)
print(tree)
1
2
3
4
5
6
7
8

10、预测test集

# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)#获得最好的分割点,下标值，groups
	split(root, max_depth, min_size, 1)
	return root	
	
# tree=build_tree(train_set, max_depth, min_size)
# print(tree)		




# Make a prediction with a decision tree
def predict(node, row):
	print(row[node['index']])
	print(node['value'])
	if row[node['index']] < node['value']:#用测试集来代入训练的最好分割点，分割点有偏差时，通过搜索左右叉树来进一步比较。
		if isinstance(node['left'], dict):#如果是字典类型，执行操作
			return predict(node['left'], row)
		else:
			return node['left']
	else:
		if isinstance(node['right'], dict):
			return predict(node['right'], row)
		else:
			return node['right']

tree = build_tree(train_set, max_depth, min_size)
predictions = list()
for row in test_set:
	prediction = predict(tree, row)
	predictions.append(prediction)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

11.评价决策树

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores	
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

最近开通了个公众号，主要分享python原理与应用，推荐系统，风控等算法相关的内容，感兴趣的伙伴可以关注下。
在这里插入图片描述

参考：

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小丑西瓜9/article/detail/131392