当前位置:   article > 正文

机器学习项目实战——11集成学习算法之泰坦尼克号船员获救预测_机器学习,集成模型,泰坦尼克

机器学习,集成模型,泰坦尼克

数据集采用的是kaggle比赛中公开的数据集——泰坦尼克号

对之前的机器学习方法分别进行了预测。

包括:逻辑回归    0.7901234567901234、

           神经网络    0.7878787878787877、

            KNN          0.8125701459034792、

            决策树       0.8080808080808081、

            随机森林    0.7991021324354657、   0.8181818181818182

            Bagging     0.8282828282828283、和随机森林做集成

            Adaboost​​​​​​​   0.8181818181818182、和bagging做集成

            Stacking    0.8125701459034792

整体代码:

  1. import pandas
  2. titanic = pandas.read_csv("titanic_train.csv")
  3. # 空余的age填充整体age的中值
  4. titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
  5. print(titanic.describe())
  6. print(titanic["Sex"].unique())
  7. # 把male变成0,把female变成1
  8. titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
  9. titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
  10. print(titanic["Embarked"].unique())
  11. # 数据填充
  12. titanic["Embarked"] = titanic["Embarked"].fillna('S')
  13. # 把类别变成数字
  14. titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
  15. titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
  16. titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
  17. from sklearn.preprocessing import StandardScaler
  18. # 选定特征
  19. predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
  20. x_data = titanic[predictors]
  21. y_data = titanic["Survived"]
  22. # 数据标准化
  23. scaler = StandardScaler()
  24. x_data = scaler.fit_transform(x_data)
  25. # 逻辑回归
  26. from sklearn import model_selection
  27. from sklearn.linear_model import LogisticRegression
  28. # 逻辑回归模型
  29. LR = LogisticRegression()
  30. # 计算交叉验证的误差
  31. scores = model_selection.cross_val_score(LR, x_data, y_data, cv=3)
  32. # 求平均
  33. print(scores.mean())
  34. # 0.7901234567901234
  35. # 神经网络模型
  36. from sklearn.neural_network import MLPClassifier
  37. # 建模
  38. mlp = MLPClassifier(hidden_layer_sizes=(20,10),max_iter=1000)
  39. # 计算交叉验证的误差
  40. scores = model_selection.cross_val_score(mlp, x_data, y_data, cv=3)
  41. # 求平均
  42. print(scores.mean())
  43. # 0.7878787878787877
  44. # KNN模型
  45. from sklearn import neighbors
  46. knn = neighbors.KNeighborsClassifier(21)
  47. # 计算交叉验证的误差
  48. scores = model_selection.cross_val_score(knn, x_data, y_data, cv=3)
  49. # 求平均
  50. print(scores.mean())
  51. # 0.8125701459034792
  52. # 决策树模型
  53. from sklearn import tree
  54. # 决策树模型
  55. dtree = tree.DecisionTreeClassifier(max_depth=5, min_samples_split=4)
  56. # 计算交叉验证的误差
  57. scores = model_selection.cross_val_score(dtree, x_data, y_data, cv=3)
  58. # 求平均
  59. print(scores.mean())
  60. # 0.8080808080808081
  61. # 随机森林模型
  62. from sklearn.ensemble import RandomForestClassifier
  63. RF1 = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2)
  64. # 计算交叉验证的误差
  65. scores = model_selection.cross_val_score(RF1, x_data, y_data, cv=3)
  66. # 求平均
  67. print(scores.mean())
  68. # 0.7991021324354657
  69. RF2 = RandomForestClassifier(n_estimators=100, min_samples_split=4)
  70. # 计算交叉验证的误差
  71. scores = model_selection.cross_val_score(RF2, x_data, y_data, cv=3)
  72. # 求平均
  73. print(scores.mean())
  74. # 0.8181818181818182
  75. # Bagging
  76. from sklearn.ensemble import BaggingClassifier
  77. bagging_clf = BaggingClassifier(RF2, n_estimators=20)
  78. # 计算交叉验证的误差
  79. scores = model_selection.cross_val_score(bagging_clf, x_data, y_data, cv=3)
  80. # 求平均
  81. print(scores.mean())
  82. # 0.8282828282828283
  83. # AdaBoost模型
  84. from sklearn.ensemble import AdaBoostClassifier
  85. # AdaBoost模型
  86. adaboost = AdaBoostClassifier(bagging_clf,n_estimators=10)
  87. # 计算交叉验证的误差
  88. scores = model_selection.cross_val_score(adaboost, x_data, y_data, cv=3)
  89. # 求平均
  90. print(scores.mean())
  91. # 0.8181818181818182
  92. # Stacking
  93. from sklearn.ensemble import VotingClassifier
  94. from mlxtend.classifier import StackingClassifier
  95. sclf = StackingClassifier(classifiers=[bagging_clf, mlp, LR],
  96. meta_classifier=LogisticRegression())
  97. sclf2 = VotingClassifier([('adaboost',adaboost), ('mlp',mlp), ('LR',LR),('knn',knn),('dtree',dtree)])
  98. # 计算交叉验证的误差
  99. scores = model_selection.cross_val_score(sclf2, x_data, y_data, cv=3)
  100. # 求平均
  101. print(scores.mean())
  102. # 0.8125701459034792

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/237113
推荐阅读
相关标签
  

闽ICP备14008679号