当前位置:   article > 正文

机器学习之随机森林实践:手写字识别、天气最高温度预测_基于随机森林的文字识别算法

基于随机森林的文字识别算法

1、RandomForestClassifier基本参数说明

要使用RandomForestClassifier算法进行分类,我们需要先了解RandomForestClassifier算法的一些基本参数。

  1. RandomForestClassifier(n_estimators=10,
  2. criterion=’gini’,
  3. max_depth=None,
  4. bootstrap=True,
  5. random_state=None,
  6. min_samples_split=2)
  7. n_estimators:
  8. integer,optionaldefault = 10),森林里的树木数量 120,200,300,500,800,1200
  9. criteria:
  10. string,可选(default =“gini”)分割特征的测量方法
  11. max_depth:
  12. integer或None,可选(默认=无)树的最大深度 5,8,15,25,30
  13. max_features="auto”,每个决策树的最大特征数量
  14. If “auto”, then max_features=sqrt(n_features).
  15. If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
  16. If “log2”, then max_features=log2(n_features).
  17. If None, then max_features=n_features.
  18. bootstrap:
  19. boolean,optional(default = True)是否在构建树时使用放回抽样
  20. min_samples_split:节点划分最少样本数
  21. min_samples_leaf:叶子节点的最小样本数
  22. 超参数:
  23. n_estimator
  24. max_depth
  25. min_samples_split
  26. min_samples_leaf

2、 随机森林预测tanic生存状况(简单示例代码)

  1. from sklearn.ensemble import RandomForestClassifier
  2. from sklearn.model_selection import GridSearchCV
  3. # 1> 实例化一个估计器
  4. estimator=RandomForestClassifier()
  5. # 2> 网格搜索优化随机森林模型
  6. param_dict={"n_estimators":[120,200,300,500,800,1200],"max_depth":[5,8,15,25,30]}
  7. estimator=GridSearchCV(estimator,param_grid=param_dict,cv=5)
  8. # 3> 传入训练集,进行模型训练
  9. estimator.fit(x_train,y_train)
  10. # 4> 模型评估
  11. # 方法1,比较真实值与预测值
  12. y_predict=estimator.predict(x_test)
  13. print("预测值为:\n",y_predict)
  14. print("比较真实值与预测值结果为:\n",y_predict==y_test)
  15. # 方法2,计算模型准确率
  16. print("模型准确率为:\n",estimator.score(x_test,y_test))
  17. print("在交叉验证中最的结果:\n",estimator.best_score_)
  18. print("最好的参数模型:\n",estimator.best_estimator_)
  19. print("每次交叉验证后的结果准确率为/n",estimator.cv_results_)

3、手写字识别

  1. from sklearn.datasets import load_digits
  2. digits = load_digits()
  3. #显示前几个数字图像
  4. fig = plt.figure(figsize=(6,6))
  5. fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
  6. for i in range(64):
  7. ax = fig.add_subplot(8,8,i+1, xticks=[], yticks=[])
  8. ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
  9. ax.text(0,7,str(digits.target[i]))

è¿éåå¾çæè¿°

  1. #用随机森林快速对数字进行分类
  2. from sklearn.cross_validation import train_test_split
  3. from sklearn.tree import DecisionTreeClassifier
  4. from sklearn.ensemble import RandomForestClassifier
  5. x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=0)
  6. model = RandomForestClassifier(n_estimators=1000)
  7. model.fit(x_train, y_train)
  8. ypre = model.predict(x_test)
  9. #查看分类器的分类结果报告
  10. from sklearn import metrics
  11. print(metrics.classification_report(ypre, y_test))
  12. #输出结果:
  13. precision recall f1-score support
  14. 0 1.00 0.97 0.99 38
  15. 1 0.98 0.98 0.98 43
  16. 2 0.95 1.00 0.98 42
  17. 3 0.98 0.96 0.97 46
  18. 4 0.97 1.00 0.99 37
  19. 5 0.98 0.96 0.97 49
  20. 6 1.00 1.00 1.00 52
  21. 7 1.00 0.96 0.98 50
  22. 8 0.94 0.98 0.96 46
  23. 9 0.98 0.98 0.98 47
  24. avg / total 0.98 0.98 0.98 450
  1. #画出混淆矩阵
  2. from sklearn.metrics import confusion_matrix
  3. mat = confusion_matrix(y_test, ypre)
  4. sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
  5. plt.xlabel('true label')
  6. plt.ylabel('predicted label')

è¿éåå¾çæè¿°

4、随机森林预测天气最高温度

4.1、载入数据

  1. import pandas as pd
  2. # 载入数据
  3. features = pd.read_csv('data/temps.csv')
  4. # 数据前五行
  5. features.head(5)
  6. # 数据特征
  7. print('The shape of our features is:', features.shape)
  8. features.describe()

4.2、数据预处理

  1. # one-hot 编码
  2. features = pd.get_dummies(features)
  3. features.head(5)
  4. # 标签与数据划分
  5. import numpy as np
  6. # 标签
  7. labels = np.array(features['actual'])
  8. # 数据
  9. features= features.drop('actual', axis = 1)
  10. # 转变成列表
  11. feature_list = list(features.columns)
  12. # 转变成np.array格式
  13. features = np.array(features)

4.3、数据划分处理

  1. # 训练集与测试集划分
  2. from sklearn.model_selection import train_test_split
  3. train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25,random_state = 42)

4.4、建立模型

  1. from sklearn.ensemble import RandomForestRegressor
  2. # 模型建立
  3. rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
  4. # 训练
  5. rf.fit(train_features, train_labels)

4.5、天气预测

  1. # 使用随机森林带的预测方法进行预测
  2. predictions = rf.predict(test_features)
  3. # 计算绝对误差
  4. errors = abs(predictions - test_labels)
  5. # 绝对误差,保留两位小数
  6. print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
  7. # 计算 平均绝对百分误差mean absolute percentage error (MAPE)
  8. mape = 100 * (errors / test_labels)
  9. # 准确率,这里使用了平均绝对百分误差来计算准确率
  10. accuracy = 100 - np.mean(mape)
  11. print('Accuracy:', round(accuracy, 2), '%.')

计算结果是:Accuracy: 93.99 %.

4.6、特征重要性

  1. # 数字特征重要性
  2. importances = list(rf.feature_importances_)
  3. feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
  4. # 排序
  5. feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
  6. [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

结果:

从而可知最重要的特征是temp1,接下来我们根据此特征重新构建模型

4.7、使用最重要的特征构建模型

  1. # 使用最重要的两个特征构建模型
  2. rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42)
  3. # 获取这两个最重要的特征,划分数据集
  4. important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
  5. train_important = train_features[:, important_indices]
  6. test_important = test_features[:, important_indices]
  7. # 训练
  8. rf_most_important.fit(train_important, train_labels)
  9. # 预测
  10. predictions = rf_most_important.predict(test_important)
  11. # 误差
  12. errors = abs(predictions - test_labels)
  13. print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
  14. mape = np.mean(100 * (errors / test_labels))
  15. accuracy = 100 - mape
  16. # 准确率
  17. print('Accuracy:', round(accuracy, 2), '%.')

结果是:

  1. Mean Absolute Error: 3.9 degrees.
  2. Accuracy: 93.8 %.

4.8、构建随机森林完整代码

  1. import pandas as pd
  2. import numpy as np
  3. from sklearn.model_selection import train_test_split
  4. from sklearn.ensemble import RandomForestRegressor
  5. # 载入数据,one-hot化
  6. original_features = pd.read_csv('data/temps.csv')
  7. original_features = pd.get_dummies(original_features)
  8. # 标签
  9. original_labels = np.array(original_features['actual'])
  10. # 去除标签值
  11. original_features= original_features.drop('actual', axis = 1)
  12. # 特征
  13. original_feature_list = list(original_features.columns)
  14. # 转变成numpy.array格式
  15. original_features = np.array(original_features)
  16. # 切分数据集
  17. original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(original_features, original_labels, test_size = 0.25, random_state = 42)
  18. # 模型建立
  19. rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
  20. # 模型训练
  21. rf.fit(original_train_features, original_train_labels);
  22. # 预测
  23. predictions = rf.predict(original_test_features)
  24. # 计算误差
  25. errors = abs(predictions - original_test_labels)
  26. # 计算mean absolute error (mae)
  27. print('Average model error:', round(np.mean(errors), 2), 'degrees.')
  28. # 计算mean absolute percentage error (MAPE)
  29. mape = 100 * (errors / original_test_labels)
  30. # 计算准确率
  31. accuracy = 100 - np.mean(mape)
  32. print('Accuracy:', round(accuracy, 2), '%.')

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/354665
推荐阅读
相关标签
  

闽ICP备14008679号