Update: The examples in this post were updated
I am reposting this question here after not getting a clear answer in a previous SO post
I am looking for a help building a data preprocessing pipleline using sklearn's ColumnTransformer functions where the some features are preprocesses sequentially. I am well aware of how to build separate pipelines for different subsets of features. For example, my pipleline may look something like this:
from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import Normalizer ColumnTransformer(remainder='passthrough', transformers=[ ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2']) ('Std', StandardScaler(), ['feat_3', 'feat_4']), ('Norm', Normalizer(), ['feat_5', 'feat_6']), ])
Notice that each transformer is provided a unique set of features.
The issue I am encountering is how to apply sequential processing for the same features (different combinations of transformations and features). For example,
ColumnTransformer(remainder='passthrough', transformers=[ ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2', , 'feat_5']) ('Std', StandardScaler(), ['feat_1', 'feat_2','feat_3', 'feat_4', 'feat_6']), ('Norm', Normalizer(), ['feat_1', 'feat_6']) ])
Notice that feat_1 was provided to three transformations, feat_2 was provided to two transformers (impute and Std), and feat_6 was provided to two transformers (Std and Norm)
A pipeline like this will two duplicate columns for feat_2 and feat_3, and three duplicate columns for feat_1. Building a separate pipeline for each transformation/feature combination is not scalable.