Surrogate splits in Python

Question

I want to use RandomForestClassifier from Sklearn to predict categorical variable (credit risk). But one of the predictors seems to have missing values:

Saving accounts little 603 NaN 183 moderate 103 quite rich 63 rich 48

This predictor seems to be the most powerful to predict a credit risk, but there is almost 20% of data missing. Predictor is naturally ordered, so I don't want to create 'NaN' category.

Some decision trees allows using surrogate variables to handle such missing values, but trees/forests from Sklearn don't have this feature. So the question is - is there a Python library similar to Sklearn (ideally the extension of Sklearn), that allows using surrogate splits?

Lucas Morin · Accepted Answer · 2023-02-13 13:46:43Z

In credit scoring a missing amount often means 0 (or account not even opened). This is probably the case so the intuitive merge would be with little (or creating a '0' category, which would be equivalent to a 'missing' category). Start by checking the average default rate by categories if you want to see if there is a difference between 'nan'/'0' with 'little'.

The main catch is often that the target is related to the account and having no money (or no account) lead to fewer defaults, breaking the natural monotonicity - richer = lower risk).

according to 'bad' Risk rate, NaN 'Saving account' is closer to 'quite rich', than 'little' — Ars ML, CommentedFeb 13, 2023 at 14:56

Stack Exchange Network

Surrogate splits in Python

1 Answer 1

Hot Network Questions

Surrogate splits in Python

1 Answer 1

Related

Hot Network Questions