当前位置:   article > 正文

极度随机树ExtraTreesClassifier

extratreesclassifier

                            极度随机树ExtraTreesClassifier

1 声明

本文的数据来自网络,部分代码也有所参照,这里做了注释和延伸,旨在技术交流,如有冒犯之处请联系博主及时处理。

2 极度随机树ExtraTreesClassifier简介

Extremely Randomized Trees Classifier(极度随机树) 是一种集成学习技术,它将森林中收集的多个去相关决策树的结果聚集起来输出分类结果。极度随机树的每棵决策树都是由原始训练样本构建的。在每个测试节点上,每棵树都有一个随机样本,样本中有k个特征,每个决策树都必须从这些特征集中选择最佳特征,然后根据一些数学指标(一般是基尼指数)来拆分数据。这种随机的特征样本导致多个不相关的决策树的产生。

在构建森林的过程中,对于每个特征,计算用于分割特征决策的数学指标(如使用基尼指数)的归一化总缩减量,这个值称为基尼要素的重要性。基尼重要性按降序排列后,可根据需要选择前k个特征。

3 极度随机树ExtraTreesClassifier代码示例

  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. from sklearn.ensemble import ExtraTreesClassifier
  4. import matplotlib
  5. # 自定义字体,以兼容中文显示
  6. myfont = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\STKAITI.TTF')
  7. df_pre = pd.read_csv('../input/PlayTennis.txt',sep="\t")
  8. # 拆分X(自变量)和y(因变量)
  9. #X = df.drop('Play Tennis', axis=1)
  10. df=df_pre.drop('Day', axis = 1)
  11. #分类类型转数值型,通过字典映射转换
  12. weather_mapper = {'Sunny': 1, 'Overcast': 2,'Rain':3}
  13. df['Outlook'].replace(weather_mapper, inplace=True)
  14. temperature_mapper = {'Hot': 1, 'Mild': 2,'Cool':3}
  15. df['Temperature'].replace(temperature_mapper, inplace=True)
  16. humidity_mapper = {'High': 1, 'Normal': 2}
  17. df['Humidity'].replace(humidity_mapper, inplace=True)
  18. wind_mapper = {'Weak': 1, 'Strong': 0}
  19. df['Wind'].replace(wind_mapper, inplace=True)
  20. playTennis_mapper={"Yes":1,"No":0}
  21. df['PlayTennis'].replace(playTennis_mapper, inplace=True)
  22. print(df.head())
  23. y = df['PlayTennis']
  24. X = df.loc[ :,'Outlook':'Wind']
  25. #X = df.drop('PlayTennis', axis = 1)
  26. #print(X.head())
  27. # 5棵树、2个特征、评判指标是熵
  28. extra_tree_forest = ExtraTreesClassifier(n_estimators=5,
  29. criterion='entropy', max_features=2)
  30. extra_tree_forest.fit(X, y)
  31. # 计算每个特征的重要性水平
  32. feature_importance = extra_tree_forest.feature_importances_
  33. # 标准化特征的重要性水平
  34. feature_importance_normalized = np.std([tree.feature_importances_ for tree in
  35. extra_tree_forest.estimators_],
  36. axis=0)
  37. #画图
  38. # Plotting a Bar Graph to compare the models
  39. plt.bar(X.columns, feature_importance_normalized)
  40. plt.xlabel('特征',fontproperties = myfont)
  41. plt.ylabel('特征重要性',fontproperties = myfont)
  42. plt.title('特征重要性比较',fontproperties = myfont)
  43. plt.show()

4 计算示意:

熵公示:

其中c为唯一类标签的个数,p i为所属分类的行占比。

  1. -- 构造数据
  2. CREATE TABLE PlayTennis(
  3. DayNo varchar(10),
  4. Outlook varchar(10),
  5. Temperature varchar(10),
  6. Humidity varchar(10),
  7. Wind varchar(10),
  8. PlayTennis varchar(10)
  9. );
  10. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D1','Sunny','Hot','High','Weak','No');
  11. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D2','Sunny','Hot','High','Strong','No');
  12. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D3','Overcast','Hot','High','Weak','Yes');
  13. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D4','Rain','Mild','High','Weak','Yes');
  14. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D5','Rain','Cool','Normal','Weak','Yes');
  15. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D6','Rain','Cool','Normal','Strong','No');
  16. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D7','Overcast','Cool','Normal','Strong','Yes');
  17. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D8','Sunny','Mild','High','Weak','No');
  18. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D9','Sunny','Cool','Normal','Weak','Yes');
  19. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D10','Rain','Mild','Normal','Weak','Yes');
  20. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D1','Sunny','Mild','Normal','Strong','Yes');
  21. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D12','Overcast','Mild','High','Strong','Yes');
  22. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D13','Overcast','Hot','Normal','Weak','Yes');
  23. insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D14','Rain','Mild','High','Strong','No');
  24. -- 计算熵
  25. WITH CTE1 AS
  26. (
  27. SELECT  DISTINCT COUNT(PlayTennis)OVER(PARTITION BY PlayTennis) gp,tatal
  28. FROM PlayTennis,(SELECT COUNT(*) tatal FROM PlayTennis) A
  29. )
  30. SELECT SUM(-(gp/tatal)*LOG(2,gp/tatal)) entropy_s  FROM
  31. (
  32. SELECT gp,tatal
  33. FROM CTE1
  34.  )A
  35. -- 0.940285959354754

假设第一棵决策树选择了特征Outlook 和Temperature,则

  1. -- 计算OutLook特征的信息增益
  2. WITH CTE2 AS
  3. (
  4. SELECT  DISTINCT COUNT(PlayTennis)OVER(PARTITION BY Outlook,PlayTennis
  5. ORDER BY PlayTennis) gp,
  6. COUNT(1)OVER(PARTITION BY Outlook ) num,
  7. Outlook,PlayTennis,
  8. (SELECT COUNT(*) tatal FROM PlayTennis) tatal
  9. FROM PlayTennis
  10. )
  11. SELECT 0.940285959354754-SUM(-(num/tatal)*(gp/num)*LOG(2,gp/num)) Gain_S_OutLook
  12. FROM CTE2
  13. -- 0.246749820735977

同理

第二棵决策树选择了特征Temperature 和Wind,则Gain计算如下:

第三棵决策树选择了特征Outlook和Humidity,则Gain计算如下:

第四棵决策树选择了特征Temperature和Humidity,则Gain计算如下:

第五棵决策树选择了特征Wind 和 Humidity,则Gain计算如下:

则各个特征的gain(信息增益)汇总如下:

Outlook:0.246+0.246= 0.492

Temperature:0.029+0.029+0.029 = 0.087

Humidity:=0.151+0.151+0.151 = 0.453

Wind:0.048+0.048 = 0.096

所以极度随机树来确定的最重要变量是特征 Outlook。

注:因特征选择的随机性,这里计算的特征重要水平可能有差异。

5 总结

Refer :

https://www.geeksforgeeks.org/ml-extra-tree-classifier-for-feature-selection/

https://machinelearningmastery.com/extra-trees-ensemble-with-python/

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/931673
推荐阅读
相关标签
  

闽ICP备14008679号