Feature selection and model performance

Asked2 years, 7 months ago

Viewed 81 times

Featuretools provides an automated way to generate features from your data, by providing relationships within your data and applying their so-called deep feature synthesis. It generates features like mean, mode, etc of existing features for all possible groups.

By using the max_depth parameter different depth levels of the generating process can be selected and therefore obviously also the total number of features varies.

Now I was benchmarking different parameter settings with a scikit-learn SVM classifier and noticed that the total number of features varied from about 64 to 184 but the model performance both in the test and validation set varied only very little with around one percentage point. Other models showed even closer results like RandomForest or xgboost.

Now my question is why is this the case? From my gut feeling, I would expect higher differences.

Here you can find the link to my notebook. Thanks in advance.

edited Sep 26, 2022 at 16:45

asked Sep 21, 2022 at 5:24

holzben

1214 bronze badges

1
$\begingroup$Especially with the tree-based models, I suspect the model just recognizes the additional features as having little additional value (even on the training set) and doesn't use them much.$\endgroup$
– Ben Reiniger♦
CommentedSep 27, 2022 at 14:45
$\begingroup$Thanks. Good point, I should add some feature importance metrics.$\endgroup$
– holzben
CommentedSep 28, 2022 at 7:34

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Feature selection and model performance

0

Hot Network Questions

Feature selection and model performance

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions