赞
踩
Python的sklearn.pipeline.Pipeline()
函数将多个学习器组成流水线,所谓流水线即数据在前一个节点处理之后的结果,转到下一个节点处理。通常流水线的形式为:数据标准化学习器——特征提取学习器——执行预测的学习器。除了最后一个学习器外,其他学习器都必须实现fit
和transform
方法, 最后一个学习器需要实现fit
方法。当训练样本数据送进Pipeline进行处理时, 它会逐个调用学习器的fit
和transform
方法,然后点用最后一个学习器的fit
方法来拟合数据。
当我们对训练集应用各种预处理操作时,我们都需要对测试集重复利用这些参数,pipeline 实现了对全部步骤的流式化封装和管理,可以很方便地使参数集在新数据集上被重复使用。
pipeline 可以用于下面几处:
1. 模块化 Feature Transform
''' 对数据集 Breast Cancer Wisconsin 进行分类:包含 569 个样本, 第一列 ID,第二列类别(M=恶性肿瘤,B=良性肿瘤),第 3-32 列是实数值的特征。''' from pandas as pd from sklearn.cross_validation import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' 'breast-cancer-wisconsin/wdbc.data', header=None) # Breast Cancer Wisconsin dataset X, y = df.values[:, 2:], df.values[:, 1] encoder = LabelEncoder() y = encoder.fit_transform(y) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0) pipe_lr = Pipeline([('sc', StandardScaler()), ('pca', PCA(n_components=2)), ('clf', LogisticRegression(random_state=1)) ]) pipe_lr.fit(X_train, y_train) print('Test accuracy: %.3f' % pipe_lr.score(X_test, y_test))
2. 自动化 Grid Search
import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('vect', TfidfVectorizer(stop_words='english')), ('clf', LogisticRegression()) ]) parameters = { 'vect__max_df': (0.25, 0.5, 0.75), 'vect__stop_words': ('english', None), 'vect__max_features': (2500, 5000, None), 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__use_idf': (True, False), 'clf__penalty': ('l1', 'l2'), 'clf__C': (0.01, 0.1, 1, 10), } df = pd.read_csv('./sms.csv') X = df['message'] y = df['label'] label_encoder = LabelEncoder() y = label_encoder.fit_transform(y) X_train, X_test, y_train, y_test = train_test_split(X, y) grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3) grid_search.fit(X_train, y_train) print('Best score: %0.3f' % grid_search.best_score_) print('Best parameters set:') best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print('\t%s: %r' % (param_name, best_parameters[param_name])) predictions = grid_search.predict(X_test) print('Accuracy: %s' % accuracy_score(y_test, predictions)) print('Precision: %s' % precision_score(y_test, predictions)) print('Recall: %s' % recall_score(y_test, predictions))
3. 自动化 Ensemble Generation
sklearn.pipeline.Pipeline官方文档
python︱sklearn一些小技巧的记录(pipeline…)
用 Pipeline 将训练集参数重复应用到测试集
使用GridSearchCV进行网格搜索微调模型
再论sklearn分类器
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。