2
$\begingroup$

I want to use RandomForestClassifier from Sklearn to predict categorical variable (credit risk). But one of the predictors seems to have missing values:

Saving accounts little 603 NaN 183 moderate 103 quite rich 63 rich 48 

This predictor seems to be the most powerful to predict a credit risk, but there is almost 20% of data missing. Predictor is naturally ordered, so I don't want to create 'NaN' category.

Some decision trees allows using surrogate variables to handle such missing values, but trees/forests from Sklearn don't have this feature. So the question is - is there a Python library similar to Sklearn (ideally the extension of Sklearn), that allows using surrogate splits?

$\endgroup$

    1 Answer 1

    1
    $\begingroup$

    In credit scoring a missing amount often means 0 (or account not even opened). This is probably the case so the intuitive merge would be with little (or creating a '0' category, which would be equivalent to a 'missing' category). Start by checking the average default rate by categories if you want to see if there is a difference between 'nan'/'0' with 'little'.

    The main catch is often that the target is related to the account and having no money (or no account) lead to fewer defaults, breaking the natural monotonicity - richer = lower risk).

    $\endgroup$
    1
    • $\begingroup$according to 'bad' Risk rate, NaN 'Saving account' is closer to 'quite rich', than 'little'$\endgroup$
      – Ars ML
      CommentedFeb 13, 2023 at 14:56

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.