I am working on the Kaggle House Price Prediction competition and have built a Scikit-Learn pipeline that includes:
Preprocessing (handling missing values, scaling, encoding)
Feature Engineering
Encoding (LabelEncoder, Ordinal Encoders, etc.)
My full pipeline looks like this:
# Full pipeline full_pipeline = Pipeline([ ('preprocessor', preprocessor), # Preprocessing pipeline ('feature_engineering', feature_engineer), # Feature engineering pipeline ('encoder', encoding_pipeline), # Encoding pipeline ])
I have tested this pipeline using a Decision Tree baseline model with 5-fold cross-validation to prevent data leakage:
# Create DecisionTreeRegressor model decision_tree = DecisionTreeRegressor(random_state=42) # Create baseline decision tree pipeline baseline_decision_tree_pipeline = Pipeline([ ('preprocessing', full_pipeline), ('decision_tree', decision_tree) ]) # Define CV strategy kf = KFold(n_splits=5, shuffle=True, random_state=42) # Get CV RMSE scores cv_scores = cross_val_score(baseline_decision_tree_pipeline, X, # Raw unprocessed data y, # Target cv=kf, scoring='neg_root_mean_squared_error', n_jobs=-1) print("CV RMSE scores:", -cv_scores) print("Average CV RMSE:", (-cv_scores).mean())
I now want to integrate RFECV (Recursive Feature Elimination with Cross-Validation) to select the best features while preventing data leakage from pre-processing steps like imputation. However, I am unsure about the best approach:
Option 1: RFECV inside the cross-validation loop
If I use RFECV inside the pipeline, then for each of the 5 folds, it will select a different subset of "optimal" features.
How do I decide on the final set of selected features across all folds?
Option 2: Fit the pre-processing pipeline on the training set and then perform RFECV
This seems like it would introduce data leakage, since the training folds used for RFECV would have information leaked from their corresponding test folds when the pre-processing was done.
Questions
Is it correct to include RFECV inside the pipeline? This ensures feature selection happens within each fold, but how do I extract the final set of optimal features? I have attempted a few different things but cannot get what I am looking for.
Am I well off and there is a better way to do this altogether?
This is my current attempt at including RFECV in the pipeline (Option 1) which returns the 5 RMSE scores:
# Perform Recursive Feature Elimination with Cross-Validation selector = RFECV(estimator=decision_tree, step=1, cv=kf, scoring="neg_root_mean_squared_error") # Create baseline decision tree pipeline with feature selection dt_pipeline = Pipeline([ ('preprocessing', full_pipeline), ('feature_selection', selector), ('decision_tree', decision_tree) ]) cv_scores_rfecv = cross_val_score( dt_pipeline, X, # Raw unprocessed data y, # Target cv=kf, scoring='neg_root_mean_squared_error', n_jobs=-1 ) print("CV RMSE scores:", -cv_scores_rfecv) print("Average CV RMSE:", (-cv_scores_rfecv).mean())