赞
踩
In this tutorial, you will learn how to use pipelines to clean up your modeling code.
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
As in the previous tutorial, we will work with the Melbourne Housing dataset.
We won’t focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train
, X_valid
, y_train
, and y_valid
.
import pandas as pd from sklearn.model_selection import train_test_split # Read the data data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv') # Separate target from predictors y = data.Price X = data.drop(['Price'], axis=1) # Divide data into training and validation subsets X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0) # "Cardinality" means the number of unique values in a column # Select categorical columns with relatively low cardinality (convenient but arbitrary) categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"] # Select numerical columns numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']] # Keep selected columns only my_cols = categorical_cols + numerical_cols X_train = X_train_full[my_cols].copy() X_valid = X_valid_full[my_cols].copy()
We take a peek at the training data with the head()
method below. Notice that the data contains both categorical data and columns with missing values. With a pipeline, it’s easy to deal with both!
X_train.head()
We construct the full pipeline in three steps.
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer
class to bundle together different preprocessing steps. The code below:
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Preprocessing for numerical data numerical_transformer = SimpleImputer(strategy='constant') # Preprocessing for categorical data categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Bundle preprocessing for numerical and categorical data preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols) ])
Next, we define a random forest model with the familiar RandomForestRegressor
class.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
Finally, we use the Pipeline
class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:
X_valid
to the predict()
command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)from sklearn.metrics import mean_absolute_error # Bundle preprocessing and modeling code in a pipeline my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ]) # Preprocessing of training data, fit model my_pipeline.fit(X_train, y_train) # Preprocessing of validation data, get predictions preds = my_pipeline.predict(X_valid) # Evaluate the model score = mean_absolute_error(y_valid, preds) print('MAE:', score)
Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.
Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。