2
$\begingroup$

Featuretools provides an automated way to generate features from your data, by providing relationships within your data and applying their so-called deep feature synthesis. It generates features like mean, mode, etc of existing features for all possible groups.

By using the max_depth parameter different depth levels of the generating process can be selected and therefore obviously also the total number of features varies.

Now I was benchmarking different parameter settings with a scikit-learn SVM classifier and noticed that the total number of features varied from about 64 to 184 but the model performance both in the test and validation set varied only very little with around one percentage point. Other models showed even closer results like RandomForest or xgboost.

Now my question is why is this the case? From my gut feeling, I would expect higher differences.

Here you can find the link to my notebook. Thanks in advance.

$\endgroup$
2
  • 1
    $\begingroup$Especially with the tree-based models, I suspect the model just recognizes the additional features as having little additional value (even on the training set) and doesn't use them much.$\endgroup$
    – Ben Reiniger
    CommentedSep 27, 2022 at 14:45
  • $\begingroup$Thanks. Good point, I should add some feature importance metrics.$\endgroup$
    – holzben
    CommentedSep 28, 2022 at 7:34

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.