How to interpret binary classification metrics on an imbalanced data set?

Question

I have an imbalanced dataset on intrusion detection. I have (attack class) 3668045 samples and (benign class) 477 samples. I made a 70:30 Train test split. My problem is to predict whether the given node belongs to the attack class or the benign class. As a first step, I trained a decision tree model on the dataset without using any balancing technique. I obtained the following results for my model on the test set using the sklearn metrics.

Scores for Decision Tree Accuracy: 0.9998991419799247 True positive 1100391 True Negative 55 False Positive 86 False Negative 25 F2-score 0.9999661949775551 Precision 0.9999218520696025 Recall 0.9999772813190648 F1-score 0.9999495659261946 Log loss: 0.0034835750853569407 Decision Tree : AUROC (ROC Curve) = 0.999 Decision Tree : AUPR(Precision/Recall curve) = 1.000 Classification Report precision recall f1-score support 0 0.69 0.39 0.50 141 1 1.00 1.00 1.00 1100416 accuracy 1.00 1100557 macro avg 0.84 0.70 0.75 1100557 weighted avg 1.00 1.00 1.00 1100557

Why am I getting high, almost perfect AUROC and AUPR scores, even though the precision and recall for my minority class are very low? What measures can I take to improve the results such that they are not biased and my model is generalizing well? How can I ensure that?

DCrown · Accepted Answer · 2022-10-13 10:33:23Z

As you point out, your results in this case are biased. Using skewed target distributions even with a simple train-test split would lead to training your model with more of one class than the other.

There are multiple ways to prevent this, depending on how imbalanced (skewed) is your dataset. If the imbalance is not severe (e.g. a 40:60 target distribution) you could use:

Cross-validation
Stratified Cross-validation
A stratified train-test split

If the dataset is severely imbalanced, then you might train balancing techniques like:

Perform under- or over-sampling. By the amount of data that you have, it might be better for you to consider under-sampling.
Generate synthetic data based on prior knowledge
Create synthetic data using SMOTE
Use Penalised models that penalise making classification mistakes on the minority class

Depending on your goal and the severity of the imbalance you might consider simply focusing on either a high precision or high recall. It might be also worth considering changing your perspective (i.e. maybe this is not a classification problem, it could be more of a "detection" problem)

Thanks. If we use the SMOTE, do we need to apply it to both the training and test data? Or only train data (if so, why not on test data), and by detection problem, do you mean we focus on high recall; that is, it is more critical to identify malicious nodes? I will be using this in a resource-sharing environment, so I believe it is equally essential that honest nodes are not misclassified as malicious so that their resources can be shared with other nodes. — Zal, CommentedOct 14, 2022 at 7:11
SMOTE can be seen as a pre-processing step. As such, you always do the pre-processing fit with the training data. Your test set should always be left untouched until the very end. In that way you can measure the actual performance of the model. If you use the test set to balance the data, you will introduce bias. Regarding the detection problem, yes, that what I meant. If both are important, then, try to shoot for a high f1-score rather than balancing them — DCrown, CommentedOct 14, 2022 at 12:30
I need a little clarification on the SMOTE thing. Let's say I made the data split (70:30), and we have X_train, y_train, and X_test, y_test data. So first, I apply SMOTE to train data (X_train, y_train), then apply SMOTE to test data. Would using the SMOTE technique on the train and test data sets separately also introduce bias? Is resampling (SMOTE) the test data not required? as we need to ensure that our model learns both classes, and how are test data looks does not really matter. — Zal, CommentedOct 14, 2022 at 21:48
What would be the point of balancing the test set? Remember the test set in any machine learning problem is suppose to represent "new/never before seen" data. You want your model to perform well in the real world with real data. If the data acquired by itself is not balanced then that's a different discussion. The test set is only used to evaluate the true performance of the model, is not used to train it and make it learn. SMOTE creates synthetic data that helps during the training process. — DCrown, CommentedOct 17, 2022 at 8:44

Stack Exchange Network

How to interpret binary classification metrics on an imbalanced data set?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to interpret binary classification metrics on an imbalanced data set?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions