Combining sklearn pipelines with different output shape

Question

As part of a data preprocessing step, I'm trying to create a "master pipeline" from two separate pipelines, one for numerical features and one for datetime features. The numerical pipeline removes outlier rows based on an IQR filter, whereas the datetime pipeline doesn't remove any rows, only feature engineers day of week.

The issue arrives when I try to combine these into a master pipeline that performs both of these steps. I've tried using both ColumnTransformer and FeatureUnion, but both output the same error (7991 is the size of the output after removing numerical outliers, 13400 is the size output size of the datetime pipeline):

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 7991 and the array at index 1 has size 13400

These are my pipeline objects:

class FeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, feature_names): self.feature_names = feature_names def fit(self, X, y=None): return self def transform(self, X): return X[self.feature_names] class IQRFilter(BaseEstimator,TransformerMixin): def __init__(self,factor=2): self.factor = factor def outlier_detector(self,X,y=None): X = pd.Series(X).copy() q1 = X.quantile(0.25) q3 = X.quantile(0.75) iqr = q3 - q1 self.lower_bound.append(q1 - (self.factor * iqr)) self.upper_bound.append(q3 + (self.factor * iqr)) def fit(self,X,y=None): self.lower_bound = [] self.upper_bound = [] X.apply(self.outlier_detector) return self def transform(self,X,y=None): X = pd.DataFrame(X).copy() for i in range(X.shape[1]): x = X.iloc[:, i].copy() x[(x < self.lower_bound[i]) | (x > self.upper_bound[i])] = 'OUTLIER' X.iloc[:, i] = x return X class RemoveIQROutliers(BaseEstimator, TransformerMixin): def __init__(self): pass def fit(self, X, y=None): return self def transform(self, X): for col in X.columns: X = X[X[col] != 'OUTLIER'] return X class ExtractDay(BaseEstimator, TransformerMixin): def __init__(self): pass def is_business_day(self, date): return bool(len(pd.bdate_range(date, date))) def fit(self, X, y=None): return self def transform(self, X): X['day_of_week_wdd'] = X['wanted_delivery_date'].dt.dayofweek return X

And these are my two pipelines:

numerical_pipeline = Pipeline([ ('FeatureSelector', FeatureSelector(num_cols)), ('iqr_filter', IQRFilter()), ('remove_outliers', RemoveIQROutliers()), ('imputer', SimpleImputer(strategy='median')), ('std_scaler', StandardScaler()) ]) date_pipeline = Pipeline([ ('FeatureSelector', FeatureSelector(date_cols)), ('Extract_day', ExtractDay()), ])

Trying to combine them like this causes the mentioned error message:

full_pipeline = Pipeline([ ('features', FeatureUnion(transformer_list=[ ('numerical_pipeline', numerical_pipeline), ('date_pipeline', date_pipeline) ])) ]) full_pipeline.fit_transform(X_train)

What is the correct way to go about this?

You've removed rows from the numeric columns, but not the datetime ones, hence the error. — Ben Reiniger, CommentedSep 8, 2022 at 11:38
Thanks @BenReiniger. Yes I agree that that is the issue. However, I don't really know a good way of solving that issue. — fendrbud, CommentedSep 8, 2022 at 12:51

Ben Reiniger · Accepted Answer · 2022-09-08 13:46:44Z

sklearn doesn't yet really provide a good way to remove rows in pipelines. SLEP001 proposes it. imblearn has some ways to make this work, but it's semantically specific to resampling data. If you don't need to modify the target (if you'll only use this transformer on X, and not in a pipeline with a supervised model), you can make this work. One more caveat: you probably won't want to throw away outliers in production, so consider how you'll rework this transformer after training.

The point is that you should wait to remove the rows with OUTLIER entries until after you've joined the datetime features back on. (One alternative is to try to pass the information about which rows were removed to the datetime processor, but that would then require a custom alternative to FunctionUnion or ColumnTransformer.) Unfortunately, despite all of your custom transformers returning dataframes, the ways to recombine them (ColumnTransformer and FeatureUnion) won't preserve that yet (but see pandas-out PR and some linked issues/PRs). Until that's remedied, your best bet might be to modify your transformers to accept an __init__ parameter columns on which to operate, removing the FeatureSelector step.

 outlier_prune = Pipeline([ ('iqr_filter', IQRFilter(columns=num_cols)), ('remove_outliers', RemoveIQROutliers()), ]) # important: the output of this is a frame numerical_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('std_scaler', StandardScaler()) ]) preproc_pipeline = ColumnTransformer([ ('numerical_pipeline', numerical_pipeline, num_cols), ('date_eng', ExtractDay(), date_cols), ]) full_pipeline = Pipeline([ ('outliers', outlier_prune), ('preproc', preproc_pipeline), ])

Thank you, this made it more clear to me. I ended up splitting the pipelines in two steps, one doing the outlier pruning and deleting the corresponding target variable, and one for doing data processing, omitting the target variable. — fendrbud, CommentedSep 29, 2022 at 11:40

Stack Exchange Network

Combining sklearn pipelines with different output shape

1 Answer 1

Hot Network Questions

Combining sklearn pipelines with different output shape

1 Answer 1

Related

Hot Network Questions