As part of a data preprocessing step, I'm trying to create a "master pipeline" from two separate pipelines, one for numerical features and one for datetime features. The numerical pipeline removes outlier rows based on an IQR filter, whereas the datetime pipeline doesn't remove any rows, only feature engineers day of week.
The issue arrives when I try to combine these into a master pipeline that performs both of these steps. I've tried using both ColumnTransformer
and FeatureUnion
, but both output the same error (7991 is the size of the output after removing numerical outliers, 13400 is the size output size of the datetime pipeline):
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 7991 and the array at index 1 has size 13400
These are my pipeline objects:
class FeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, feature_names): self.feature_names = feature_names def fit(self, X, y=None): return self def transform(self, X): return X[self.feature_names] class IQRFilter(BaseEstimator,TransformerMixin): def __init__(self,factor=2): self.factor = factor def outlier_detector(self,X,y=None): X = pd.Series(X).copy() q1 = X.quantile(0.25) q3 = X.quantile(0.75) iqr = q3 - q1 self.lower_bound.append(q1 - (self.factor * iqr)) self.upper_bound.append(q3 + (self.factor * iqr)) def fit(self,X,y=None): self.lower_bound = [] self.upper_bound = [] X.apply(self.outlier_detector) return self def transform(self,X,y=None): X = pd.DataFrame(X).copy() for i in range(X.shape[1]): x = X.iloc[:, i].copy() x[(x < self.lower_bound[i]) | (x > self.upper_bound[i])] = 'OUTLIER' X.iloc[:, i] = x return X class RemoveIQROutliers(BaseEstimator, TransformerMixin): def __init__(self): pass def fit(self, X, y=None): return self def transform(self, X): for col in X.columns: X = X[X[col] != 'OUTLIER'] return X class ExtractDay(BaseEstimator, TransformerMixin): def __init__(self): pass def is_business_day(self, date): return bool(len(pd.bdate_range(date, date))) def fit(self, X, y=None): return self def transform(self, X): X['day_of_week_wdd'] = X['wanted_delivery_date'].dt.dayofweek return X
And these are my two pipelines:
numerical_pipeline = Pipeline([ ('FeatureSelector', FeatureSelector(num_cols)), ('iqr_filter', IQRFilter()), ('remove_outliers', RemoveIQROutliers()), ('imputer', SimpleImputer(strategy='median')), ('std_scaler', StandardScaler()) ]) date_pipeline = Pipeline([ ('FeatureSelector', FeatureSelector(date_cols)), ('Extract_day', ExtractDay()), ])
Trying to combine them like this causes the mentioned error message:
full_pipeline = Pipeline([ ('features', FeatureUnion(transformer_list=[ ('numerical_pipeline', numerical_pipeline), ('date_pipeline', date_pipeline) ])) ]) full_pipeline.fit_transform(X_train)
What is the correct way to go about this?