当前位置:   article > 正文

数据分析-流水线-pipeline_数据集划分 pipeline standardscaler

数据集划分 pipeline standardscaler

目录

前言

Pipeline

 网格搜索中使用流水线

流水线中模型选择

含特征选择的流水线

含PCA的流水线

组合流水线


前言

Sklearn pipeline 模块中的 Pipeline 将机器学习过程的全部步骤的流式化封装和管理,大幅减少代码量。

Pipeline

Pipeline 通常包含以下步骤:
  1. 数据预处理学习器 (如数据标准化、数据编码等,学习器必须有transform方法,用于数据转换)
  2. 特征选择学习器 (学习器必须有transform方法,用于数据转换)
  3. 执行预测的学习器 

以下流水线将标准化与KNN分类封装在一起。

  1. from sklearn.pipeline import Pipeline
  2. # 构建流水线
  3. pipe = Pipeline(steps=[('scaler',StandardScaler()),
  4. ('knn', KNeighborsClassifier())])
  5. # 训练
  6. pipe.fit(X_train, y_train)
  7. # 评估
  8. print("测试集分类正确率:", round(pipe.score(X_test, y_test), 2))

 网格搜索中使用流水线

  1. from sklearn.model_selection import GridSearchCV
  2. # 设置参数网络,knn流水线中对KNN分类的命名knn__(双下划线)接对应模型的参数
  3. param_grid = {'knn__n_neighbors': [2, 4, 6, 8, 10],
  4. 'knn__weights': ['uniform', 'distance']}
  5. # 网格搜索
  6. grid_search = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=5) #cv=5 5折交叉验证
  7. grid_search.fit(X_train, y_train)
  8. # 测试集上的得分
  9. grid_search.score(X_test, y_test)

含模型选择的流水线

实现

  1. from sklearn.preprocessing import MinMaxScaler
  2. from sklearn.preprocessing import StandardScaler
  3. from sklearn.linear_model import LogisticRegression
  4. from sklearn.neighbors import KNeighborsClassifier
  5. from sklearn.svm import SVC
  6. from sklearn.pipeline import Pipeline
  7. pipe=Pipeline(steps=[("scaler",MinMaxScaler()),("model",LogisticRegression())])
  8. scale_selector=[StandardScaler(),MinMaxScaler()]
  9. model_selector=[KNeighborsClassifier(),SVC(),LogisticRegression()]
  10. param_grid={"scaler":scale_selector,"model":model_selector}
  11. grid_search=GridSearchCV(estimator=pipe,param_grid=param_grid,cv=5)
  12. grid_search.fit(X_train_s,y_train)
  13. print(grid_search.best_estimator_)
  14. grid_search.score(X_test_s,y_test)

  含特征选择的流水线

实现

  1. from sklearn.feature_selection import RFECV
  2. from sklearn.tree import DecisionTreeClassifier
  3. # 在流水线中加入特征选择
  4. pipe_new = Pipeline(steps=[('scaler',StandardScaler()),
  5. ('selector', RFECV(DecisionTreeClassifier(random_state=10), cv=5)),
  6. ('model', KNeighborsClassifier())])
  7. scale_selector=[StandardScaler(),MinMaxScaler()]
  8. model_selector=[KNeighborsClassifier(),SVC(),LogisticRegression()]
  9. # 设置参数网络
  10. param_grid = {'scaler':scale_selector,
  11. 'model': model_selector,
  12. 'model__class_weight':['balanced', None],
  13. 'model__C':[0.01, 0.1, 0.2, 0.5, 1]}
  14. # 网格搜索
  15. grid_search = GridSearchCV(estimator=pipe_new, param_grid=param_grid, cv=5)
  16. grid_search.fit(X_train, y_train)
  17. # 输出最优的步骤,查看特征排名
  18. pd.Series(grid_search.best_estimator_.named_steps['selector'].ranking_, index=X_train.columns)

含PCA的流水线

 实现

  1. from sklearn.decomposition import PCA
  2. # 在管道中加入PCA
  3. pipe_new = Pipeline(steps=[('scaler',StandardScaler()),
  4. ('decomposition', PCA(3)),
  5. ('model', KNeighborsClassifier())])
  6. # 设置参数网络
  7. param_grid = {'scaler':scale_selector,
  8. 'model': model_selector,
  9. 'decomposition__n_components':[2, 3, 4, 5, 6],
  10. 'model__class_weight':['balanced', None],
  11. 'model__C':[0.01, 0.1, 0.2, 0.5, 1]}
  12. # 网格搜索
  13. grid_search = GridSearchCV(estimator=pipe_new, param_grid=param_grid, cv=5)
  14. grid_search.fit(X_train, y_train)
  15. # 查看方差贡献率
  16. grid_search.best_estimator_.named_steps['decomposition'].explained_variance_ratio_.sum()

复杂流水线

实现

  1. from sklearn.decomposition import PCA
  2. from sklearn.feature_selection import RFECV
  3. from sklearn.tree import DecisionTreeClassifier
  4. from sklearn.ensemble import RandomForestClassifier
  5. pipe_new2=Pipeline(steps=[("scaler",StandardScaler),("selector",PCA(3)),("model",KNeighborsClassifier())])
  6. model_selector=[LogisticRegression(random_state=10),SVC(),KNeighborsClassifier()]
  7. scaler_selector=[StandardScaler(),MinMaxScaler()]
  8. selector_selector=[PCA(3),RFECV(DecisionTreeClassifier(random_state=10),cv=5)]
  9. param_grid_2={"scaler":scaler_selector,"selector":selector_selector,"model":model_selector
  10. ,"model__class_weight":["balanced",None],
  11. "model__C":[0.01,0.1,0.2,0.5,1]}
  12. grid_search=GridSearchCV(estimator=pipe_new2,param_grid=param_grid_2,cv=5)
  13. grid_search.fit(X_train,y_train)
  14. grid_search.best_estimator_.named_steps["selector"].explained_variance_ratio_.sum()

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/天景科技苑/article/detail/901922
推荐阅读
相关标签
  

闽ICP备14008679号