I have a dataset for which I am trying to predict target variables.
Col1 Col2 Col3 Col4 Col5 1 2 23 11 1 2 22 12 14 1 22 11 43 38 3 14 22 25 19 3 12 42 11 14 1 22 11 43 38 2 1 2 23 11 4 2 22 12 14 2 22 11 43 38 3
I have provided a sample data, but mine has thousands of records distributed in a similar way. Here, Col1, Col2, Col3, Col4 are my features and Col5 is target variable. Hence prediction should be 1,2,3 or 4 as these are my values for target variable. I have tried using algorithms such as random forest, decision tree etc. for predictions.
Here if you see, values 1,2 and 3 are occurring more times as compared to 4. Hence while predicting, my model is more biased towards 1 2 and 3 whereas I am getting only less number of predictions for 4 (Got only 1 predicted for policy4 out of thousands of records when I saw the confusion matrix).
In order to make my model generalize, I removed equal percentage of data that belongs to 1,2 and 3 value randomly. I grouped by each value in Col5 and then removed certain percentage, so that I brought down the number of records. Now I could see certain increase in percentage of accuracy and also reasonable increase in predictions for value 4 in confusion matrix.
Is this the right approach to deal with (removing the data randomly from those groups on which the model is biased)?
I tried for in-built python algorithms like Adaboost, GradientBoost techniques using sklearn. I read these algorithms are for handling imbalance class. But I couldnt succeed in improving my accuracy, rather by randomly removing the data, where I could see some improvements.
Is this reduction is undersampling technique and is this the right approach for under-sampling?
Is there are any pre-defined packages in sklearn or any logic which I can implement in python to get this done, if my random removal is wrong?
Also, I learnt about SMOTE technique, which deals with oversampling. Should I try this for value 4? And can we do this using any in-built packages in python? It would be great if someone helps me in this situation.