当前位置:   article > 正文

机器学习算法篇(随机森林树)

随机森林树

随机森林树

  •       1.定义:

                     随机森林只利用多颗决策树对样本进行训练并预测的一种分类器,可回归可分类;随机森林是基于多颗决策树的集成算法,常见决策树算法主要分为: ID3(利用信息增益进行特征选择),C4.5 信息增益比率 = g(D,A)/H(A) ,CART 基尼系数

特征的信息增益越大,表示特征对于样本的熵的减少能力更强,这个特征让数据从不确定性到确定性能力越强。

  • 2.随机森林中Bagging 和 Boosting 概念与区别: 

       集成学习中算法分为bagging算法和boosting 算法, 随机森林属于集成学习中 bagging 算法,

bagging算法: 过程:

   1. 从原始样本集中使用Bootstraping 方法(自助方法,一种有放回的抽样方法),随机抽取N个训练样本,进行K轮抽取,得到K个训练集 (k个训练集之间相互独立,元素可以有重复)

 2. 对于K个训练集,训练K个模型(这个模型具体问题而定,比如决策树,knn)

3. 对于分类问题: 由投票产生分类结果,对于回归问题,由k个模型预测结果的均值作为最后预测结果

boosting (提升法):

    训练集中样本加入权重Wi 给与不同关注度,当某个样本被错误分类之后概率高,加大该样本的权重,不断进行迭代,每一步迭代都是一个小小的弱分类器,最后通过某种策略将其组合,作为最终模型(AdaBost 给每一个弱分类器权重,线性组合成最终分类器)

区别: 1. Bagging 采用随机放回抽样, Boosting 每一轮训练集不变,变的只是权重

            2. Bagging 权重均值取样, Boosting 根据 错误比率调整样本权重

            3.Bagging中预测函数权重相等,Boosting 误差越小预测函数权重越大

Bagging + 决策树 = 随机森林

AdaBoost + 决策树= 提升树

Gradient Boosting + 决策树 = GBDT

  • 总结: 随机森林用于分类时候,采用N个决策树分类,将分类结果用简单的投票方法得到最终分类

ExtraTree: 极端随机数与随机森林树区别: ET使用所有训练样本得到决策树,分叉时候:ET属于完全随机值得到分叉

但是随机森林是计算一个随机子集中最佳属性

验证结果时候借助机器学习包里面的交叉验证,训练训练数据分为9个训练集 1个测试集进行循环的测试,得到结果的均值作为结果输出

随机森林过程中入参:

  1. class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

几个基本重要参数: 

   n_estimators: 默认10,森林中决策树的数目

criterion: default = gini:  :

  1. ### 基于sonar.all-data 数据 随机森林处理
  2. import numpy as np
  3. from sklearn.model_selection import cross_val_score
  4. from sklearn.ensemble import RandomForestClassifier
  5. def load_data(filename):
  6. data_set = []
  7. with open(filename , 'r') as file:
  8. for line in file.readlines():
  9. data_set.append(line.strip('\n').split(','))
  10. return data_set
  11. def column_to_float(dataSet):
  12. featLen = len (dataSet[0]) -1
  13. X = []
  14. y = []
  15. for data in dataSet:
  16. for column in range(featLen):
  17. data[column] = float(data[column].strip())
  18. if data[-1] == 'R':
  19. y.append(1)
  20. elif data[-1] == 'M':
  21. y.append(0)
  22. X.append(np.array(data[0: featLen-1]))
  23. y = np.array(y)
  24. return X, y
  25. if __name__ == '__main__':
  26. dataSet = load_data('sonar.all-data')
  27. X,y = column_to_float(dataSet)
  28. ### 借助 Sklearn -随机森林 使用交叉验证平均结果0.6, 明显 决策树情况越多明显准确率有点提升
  29. clf2 = RandomForestClassifier(n_estimators= 10, max_depth= 15, max_features= 15 ,min_samples_split=2, random_state= 0)
  30. scores2 = cross_val_score(clf2, X, y)
  31. print(scores2.mean())
  32. 在这个参数结果下: 使用随机森林树加交叉验证的结果调整到的预测是 0.63

采用CART 自己进行编码方式:

     最后结果大概是在0.64 左右徘徊

  1. ### 基于sonar.all-data 数据 随机森林处理
  2. import csv
  3. from random import randrange
  4. from random import seed
  5. def loadCSV(filename):#加载数据,一行行的存入列表
  6. dataSet = []
  7. with open(filename, 'r') as file:
  8. csvReader = csv.reader(file)
  9. for line in csvReader:
  10. dataSet.append(line)
  11. return dataSet
  12. # 除了标签列,其他列都转换为float类型
  13. def column_to_float(dataSet):
  14. featLen = len(dataSet[0]) - 1
  15. for data in dataSet:
  16. for column in range(featLen):
  17. data[column] = float(data[column].strip())
  18. def splitDataSet(dataSet, n_folds):
  19. '''
  20. 对数据进行分块操作,数据必须要均分,不能那一块数据多或者 少,但是 总条数无法均分的话,只能抛弃一些数据
  21. :param dataSet:
  22. :param n_folds:
  23. :return:
  24. '''
  25. print (len(dataSet))
  26. fold_size = int(len(dataSet) / n_folds)
  27. dataSet_copy = list(dataSet)
  28. dataSet_spilt = []
  29. for i in range(n_folds):
  30. fold = []
  31. while len(fold) < fold_size: # 这里不能用ifif只是在第一次判断时起作用,while执行循环,直到条件不成立
  32. index = randrange(len(dataSet_copy))
  33. fold.append(dataSet_copy.pop(index)) # pop() 函数用于移除列表中的一个元素(默认最后一个元素),并且返回该元素的值。
  34. dataSet_spilt.append(fold)
  35. return dataSet_spilt
  36. def get_subsample(dataSet, ratio):
  37. '''
  38. 构造数据子集:随机森林分叉最佳时间获取子集
  39. 构建随机子集数据
  40. :param dataSet:
  41. :param ratio: 返回的浮点数
  42. :return:
  43. '''
  44. subdataSet = []
  45. lenSubdata = round(len(dataSet) * ratio)
  46. while len(subdataSet) < lenSubdata:
  47. index = randrange(len(dataSet) -1)
  48. subdataSet.append(dataSet[index])
  49. return subdataSet
  50. def data_split(dataSet, index, value):
  51. '''
  52. 数据按照对应的特征上值进行分支
  53. :param dataSet:
  54. :param index:
  55. :param value:
  56. :return:
  57. '''
  58. left = []
  59. right = []
  60. for row in dataSet:
  61. if row[index] < value:
  62. left.append(row)
  63. else:
  64. right.append(row)
  65. return left, right
  66. def split_loss(left, right, class_values):
  67. loss = 0.0
  68. for class_value in class_values:
  69. left_size = len(left)
  70. if left_size != 0:
  71. prop = [row[-1] for row in left].count(class_value) / float(left_size)
  72. loss += (prop * (1.0- prop))
  73. right_size = len(right)
  74. if right_size != 0:
  75. prop = [row[-1] for row in right].count(class_value) / float(right_size)
  76. loss += (prop * (1.0 - prop))
  77. return loss
  78. def get_best_split(dataSet, n_features):
  79. '''
  80. 选取任意N个特征,在这N个特征中,选取分割时候最优特征
  81. 选择进行分支的特征数目
  82. :param dataSet:
  83. :param n_features:
  84. :return:
  85. '''
  86. features = []
  87. class_values = list(set(row[-1] for row in dataSet))
  88. b_indx, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None
  89. while len(features) < n_features:
  90. index = randrange(len(dataSet[0]) -1)
  91. if index not in features:
  92. features.append(index) ### 随机挑选N个特征的colum
  93. for index in features: ##找到列中最适合做节点的索引(损失最少的)
  94. for row in dataSet:
  95. left, right = data_split(dataSet, index, row[index]) # 以它进行节点,左右分支
  96. loss = split_loss(left, right, class_values)
  97. if loss < b_loss: # 寻找最小分割代价
  98. b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right
  99. return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right}
  100. def decide_label(data):
  101. output = [row[-1] for row in data]
  102. return max(set(output), key= output.count)
  103. def sub_split(root, n_features, max_depth, min_size, depth):
  104. '''
  105. 不断的分割,,建立一颗决策树
  106. :param root:
  107. :param n_features:
  108. :param max_depth:
  109. :param min_size:
  110. :return:
  111. '''
  112. left = root['left']
  113. right = root['right']
  114. del(root['left'])
  115. del(root['right'])
  116. if not left or not right:
  117. root['left'] = root['right'] =decide_label(left+ right)
  118. return
  119. if depth > max_depth:
  120. root['left'] = decide_label(left)
  121. root['right'] =decide_label(right)
  122. return
  123. if len(left) < min_size:
  124. root['left'] = decide_label(left)
  125. else:
  126. root['left'] = get_best_split(left, n_features)
  127. sub_split(root['left'], n_features, max_depth, min_size, depth + 1)
  128. if len(right) < min_size:
  129. root['right'] = decide_label(right)
  130. else:
  131. root['right'] = get_best_split(right, n_features)
  132. sub_split(root['right'], n_features, max_depth, min_size, depth + 1)
  133. def build_tree(dataSet, n_features, max_depth, min_size):
  134. '''
  135. 构造决策树
  136. :param dataSet:
  137. :param n_features:
  138. :param max_depth:
  139. :param min_size:
  140. :return:
  141. '''
  142. root = get_best_split(dataSet, n_features)
  143. sub_split(root, n_features, max_depth, min_size,1)
  144. return root
  145. def predict(tree, row):
  146. if row[tree['index']] < tree['value']:
  147. if isinstance(tree['left'], dict):
  148. return predict(tree['left'], row)
  149. else:
  150. return tree['left']
  151. else:
  152. if isinstance(tree['right'], dict):
  153. return predict(tree['right'], row)
  154. else:
  155. return tree['right']
  156. def bagging_predict(trees, row):
  157. predictions = [predict(tree, row) for tree in trees]
  158. return max(set(predictions), key=predictions.count)
  159. def random_forest(train, test, ratio, n_features, max_depth, min_size, n_trees):
  160. '''
  161. 随机森林预测,具体决策策越使用 CART
  162. :param train_set:
  163. :param test_set:
  164. :param ratio:
  165. :param n_features:
  166. :param max_depth:
  167. :param min_size:
  168. :param n_trees:
  169. :return:
  170. '''
  171. trees = []
  172. for i in range(n_trees):
  173. train = get_subsample(train, ratio) ## 从切割的数据集中选取子集
  174. tree = build_tree(train, n_features, max_depth, min_size)
  175. trees.append(tree)
  176. predict_values = [bagging_predict(trees, row) for row in test]
  177. return predict_values
  178. def accuracy(predict_values, actual):
  179. correct = 0
  180. for i in range(len(actual)):
  181. if actual[i] == predict_values[i]:
  182. correct += 1
  183. return correct / float(len(actual))
  184. if __name__ == '__main__':
  185. seed(1)
  186. dataSet = loadCSV('sonar.all-data')
  187. column_to_float(dataSet)
  188. n_flods = 5 ## 交叉验证分割数据块
  189. max_depth = 15
  190. min_size = 1
  191. ratio = 1.0
  192. n_features = 15
  193. n_trees = 10
  194. ### 在此每一块数据进行均分,实际上数据总行数并无法进行均分
  195. dataSetChunk = splitDataSet(dataSet, n_flods)
  196. scores = []
  197. for chunk in dataSetChunk:
  198. train_set = dataSetChunk[:]
  199. train_set.remove(chunk)
  200. train_set = sum(train_set, [])
  201. test_set = []
  202. for row in chunk:
  203. row_copy = list(row)
  204. row_copy[-1] = None
  205. test_set.append(row_copy)
  206. actual = [row[-1] for row in chunk]
  207. predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees)
  208. accur = accuracy(predict_values, actual)
  209. scores.append(accur)
  210. print('Trees is %d' % n_trees)
  211. print('scores:%s' % scores)
  212. print('mean score:%s' % (sum(scores) / float(len(scores))))

数据地址: 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/91110
推荐阅读
相关标签
  

闽ICP备14008679号