Feature selection using feature importances in random forests with scikit-learn

Question

I have plotted the feature importances in random forests with scikit-learn. In order to improve the prediction using random forests, how can I use the plot information to remove features? I.e. how to spot whether a feature is useless or even worse decrease of the random forests performance, based on the plot information? The plot is based on the attribute feature_importances_ and I use the classifier sklearn.ensemble.RandomForestClassifier.

I am aware that there exist other techniques for feature selection, but in this question I want to focus on how to use feature feature_importances_.

Examples of such feature importance plots:

David · Accepted Answer · 2015-08-04 17:55:20Z

You can simply use the feature_importances_ attribute to select the features with the highest importance score. So for example you could use the following function to select the K best features according to importance.

def selectKImportance(model, X, k=5): return X[:,model.feature_importances_.argsort()[::-1][:k]]

Or if you're using a pipeline the following class

class ImportanceSelect(BaseEstimator, TransformerMixin): def __init__(self, model, n=1): self.model = model self.n = n def fit(self, *args, **kwargs): self.model.fit(*args, **kwargs) return self def transform(self, X): return X[:,self.model.feature_importances_.argsort()[::-1][:self.n]]

So for example:

>>> from sklearn.datasets import load_iris >>> from sklearn.ensemble import RandomForestClassifier >>> iris = load_iris() >>> X = iris.data >>> y = iris.target >>> >>> model = RandomForestClassifier() >>> model.fit(X,y) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) >>> >>> newX = selectKImportance(model,X,2) >>> newX.shape (150, 2) >>> X.shape (150, 4)

And clearly if you wanted to selected based on some other criteria than "top k features" then you can just adjust the functions accordingly.

Thanks David. Any insight on how to choose the threshold above which features are useful? (put aside from removing the least useful feature, running the RF again and see how it impacts the prediction performance) — Franck Dernoncourt, CommentedAug 4, 2015 at 18:02
As with most automated feature selection I'd say most people use a tuning grid. But using domain expertise when selecting (and engineering) features is probably the most valuable -- but isn't really automatable. — David, CommentedAug 4, 2015 at 18:05

Stack Exchange Network

Feature selection using feature importances in random forests with scikit-learn

1 Answer 1

Linked

Hot Network Questions

Feature selection using feature importances in random forests with scikit-learn

1 Answer 1

Linked

Related

Hot Network Questions