I want to use RandomForestClassifier from Sklearn to predict categorical variable (credit risk). But one of the predictors seems to have missing values:
Saving accounts little 603 NaN 183 moderate 103 quite rich 63 rich 48
This predictor seems to be the most powerful to predict a credit risk, but there is almost 20% of data missing. Predictor is naturally ordered, so I don't want to create 'NaN' category.
Some decision trees allows using surrogate variables to handle such missing values, but trees/forests from Sklearn don't have this feature. So the question is - is there a Python library similar to Sklearn (ideally the extension of Sklearn), that allows using surrogate splits?