赞
踩
随机森林根据森林中所有决策树计算平均不纯度的减少来测量特征的重要性,而不作任何数据是线性可分或不可分的假设。
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel df_wine = pd.read_csv("xxx\\wine.data", header=None) df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline'] # print(df_wine['Class label']) # print('Class labels', np.unique(df_wine['Class label'])) # print(df_wine.head()) X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y) mms = MinMaxScaler() X_train_norm = mms.fit_transform(X_train) X_test_norm = mms.transform(X_test) stdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) feat_labels = df_wine.columns[1:] forest = RandomForestClassifier(n_estimators=500, random_state=1) forest.fit(X_train, y_train) importances = forest.feature_importances_ print(importances) indices = np.argsort(importances)[::-1] for f in range(X_train.shape[1]): print("%2d) %-*s %f" % (f + 1, 60, feat_labels[indices[f]], importances[indices[f]])) plt.title('Feature Importance') plt.bar(range(X_train.shape[1]), importances[indices], align='center') plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() #plt.savefig('images/04_09.png', dpi=300) plt.show() # 为了总结特征的重要值和随机森林,值得一提的是scikit-learn也实现了Select-FromModel对象,可以在模型拟合后,根据用户指定的阈值选择特征 sfm = SelectFromModel(forest, threshold=0.1, prefit=True) # prefit:预设模型是否期望直接传递给构造函数 X_selected = sfm.transform(X_train) print('Number of features that meet this threshold criterion:', X_selected.shape[1]) for f in range(X_selected.shape[1]): print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
运行结果:
[0.11852942 0.02564836 0.01327854 0.02236594 0.03135708 0.05087243
0.17475098 0.01335393 0.02556988 0.1439199 0.058739 0.13616194
0.1854526 ]
运行结果图:
把葡萄酒数据集中不同的特征按其相对重要性进行排序,请注意,特征重要性值被正常化所以总和为1
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。