1
$\begingroup$

Update: The examples in this post were updated

I am reposting this question here after not getting a clear answer in a previous SO post

I am looking for a help building a data preprocessing pipleline using sklearn's ColumnTransformer functions where the some features are preprocesses sequentially. I am well aware of how to build separate pipelines for different subsets of features. For example, my pipleline may look something like this:

from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import Normalizer ColumnTransformer(remainder='passthrough', transformers=[ ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2']) ('Std', StandardScaler(), ['feat_3', 'feat_4']), ('Norm', Normalizer(), ['feat_5', 'feat_6']), ]) 

Notice that each transformer is provided a unique set of features.

The issue I am encountering is how to apply sequential processing for the same features (different combinations of transformations and features). For example,

ColumnTransformer(remainder='passthrough', transformers=[ ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2', , 'feat_5']) ('Std', StandardScaler(), ['feat_1', 'feat_2','feat_3', 'feat_4', 'feat_6']), ('Norm', Normalizer(), ['feat_1', 'feat_6']) ]) 

Notice that feat_1 was provided to three transformations, feat_2 was provided to two transformers (impute and Std), and feat_6 was provided to two transformers (Std and Norm)

A pipeline like this will two duplicate columns for feat_2 and feat_3, and three duplicate columns for feat_1. Building a separate pipeline for each transformation/feature combination is not scalable.

$\endgroup$
4
  • $\begingroup$You want a sequential pipeline, so use Pipeline: scikit-learn.org/stable/modules/generated/…$\endgroup$
    – Ben Reiniger
    CommentedSep 10, 2020 at 1:18
  • $\begingroup$This will not resolve the issue. A Pipeline like this Pipeline(steps=[('PreProc', ColumnTransformer(....)), ('model', SVC())]) will still have the same issue. Can you please provide an example.$\endgroup$CommentedSep 10, 2020 at 4:53
  • $\begingroup$Can you list for each feature what operation you want to appy? eg: feat_1: SimpleImputer, StandardScaler; feats_2: SimpleImputer; etc. ?$\endgroup$
    – qmeeus
    CommentedSep 10, 2020 at 10:05
  • $\begingroup$feat_1: SimpleImputer and StandardScaler; feat_2: SimpleImputer and StandardScaler; feat_3: StandardScaler;$\endgroup$CommentedSep 10, 2020 at 13:48

2 Answers 2

2
$\begingroup$

When you want to do sequential transformations, you should use Pipeline.

imp_std = Pipeline( steps=[ ('impute', SimpleImputer(strategy='median')), ('scale', StandardScaler()), ] ) ColumnTransformer( remainder='passthrough', transformers=[ ('imp_std', imp_std, ['feat_1', 'feat_2']), ('std', StandardScaler(), ['feat_3']), ] ) 

or

imp = ColumnTransformer( remainder='passthrough', transformers=[ ('imp', SimpleImputer(strategy='median'), ['feat_1', 'feat_2']), ] ) Pipeline( steps=[ ('imp', imp), ('std', StandardScaler()), ] ) 
$\endgroup$
6
  • $\begingroup$Ben, As I indicated in my comment to Julio, this solution is not scalable and will require me to build separate pipelines for each combination of preprocessing steps and features. I will update the original question to reflect the possible complexity$\endgroup$CommentedSep 11, 2020 at 18:30
  • 2
    $\begingroup$As Julio says, most commonly you won't have very many distinct pipelines and so the first approach above works reasonably well. If you have many different pipelines that overlap quite a bit, you could have a "master" pipeline that you slice down; see stackoverflow.com/a/62234209/10495893.$\endgroup$
    – Ben Reiniger
    CommentedSep 11, 2020 at 19:05
  • $\begingroup$The link you provided was indeed helpful. As you indicated in answer, the assumption is that last transformation operates on the entire frame, otherwise, it will fail. A good solution is to find the new order on columns between each pipeline steps and update the indices. Is this possible?$\endgroup$CommentedSep 11, 2020 at 21:26
  • 1
    $\begingroup$Not programmatically, I think. If your steps don't mess with the columns much, then you can do it by hand. Imputing and scaling, e.g., preserve column order, and a columntransformer puts the results in order of the transformations (remainder last). But if you one-hot encode (so now the number of columns is data-dependent), you'll have a very hard time.$\endgroup$
    – Ben Reiniger
    CommentedSep 11, 2020 at 21:36
  • $\begingroup$This is makes sense. That is why my questions was about a scalable (aka programmatic) approach. Thanks for the help!$\endgroup$CommentedSep 11, 2020 at 21:41
1
$\begingroup$

One way to do this is by creating separate preprocessing steps for each data type, the most common case is you have categorical and continuous variables

from sklearn.pipeline import Pipeline from sklearn.compose import make_column_transformer from sklearn.compose import make_column_selector as selector from sklearn.linear_model import LogisticRegression cont_prepro = Pipeline([("imputer",SimpleImputer(strategy = "median")),("scaler",StandarScaler())]) cat_prepro = Pipeline([("imputer",SimpleImputer(strategy = "most_frequent")),("encoder",OneHotEncoder(handle_unknown = "ignore"))]) preprocessing = make_column_transformer((cont_prepro,selector(dtype_exclude = "object")),(cat_prepro,selector(dtype_include = "object")) pipe = Pipeline([("preprocessing",preprocessing),("model",LogisticRegression())]) 

If you want to separate features on each step by listing instead of by type you should create a list with the specific columns as you already did in your example and remove the selector part.

In your case:

pipe_one = Pipeline([("num_impute",SimpleImputer(strategy='median')),('Std', StandardScaler())]) preprocessing = make_column_transformer((pipe_one,["feat_1","feat_2"]),remainder='passthrough') pipe = Pipeline([("preprocessing",preprocessing),("model",LogisticRegression())]) 
$\endgroup$
1
  • $\begingroup$Thank you for your answer. However, this solution is not scalable and will require me to build separate pipelines for each combination of preprocessing steps and features. I was hoping to find a solution where a series of preprocessing steps take place sequentially and for each step a separate sublist of features is provided. This way, given a random feature the transformation step n will be applied on top of transformation step n-1 for this feature$\endgroup$CommentedSep 11, 2020 at 18:24

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.