import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectFromModel from scipy.io import arff data = arff.loadarff("C:\\Users\\manib\\Desktop\\Python Job\\Project Work\\Breast\\Breast.arff") df = pd.DataFrame(data[0]) df.head() df["Class"].value_counts() X = df.iloc[:,:24481].values y = df.iloc[:, -1].values from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() y=y.astype('str') y= label_encoder.fit_transform(y) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0) sel = SelectFromModel(RandomForestClassifier(n_estimators = 100)) sel.fit(X_train, y_train) sel.get_support() selected_feat= X_train.columns[(sel.get_support())] len(selected_feat) print(selected_feat)
2 Answers
The problem is that train_test_split(X, y, ...)
returns numpy arrays and not pandas dataframes. Numpy arrays have no attribute named columns
If you want to see what features SelectFromModel
kept, you need to substitute X_train
(which is a numpy.array) with X
which is a pandas.DataFrame
.
selected_feat= X.columns[(sel.get_support())]
This will return a list of the columns kept by the feature selector.
If you wanted to see how many features were kept you can just run this:
sel.get_support().sum() # by default this will count 'True' as 1 and 'False' as 0
because this :
X = df.iloc[:,:24481].values y = df.iloc[:, -1].values
you should remove .values
or make extra X_col
, y_col
like that
X_col = df.iloc[:,:24481] y_col = df.iloc[:, -1]