11
$\begingroup$

I have a dataset for which I am trying to predict target variables.

Col1 Col2 Col3 Col4 Col5 1 2 23 11 1 2 22 12 14 1 22 11 43 38 3 14 22 25 19 3 12 42 11 14 1 22 11 43 38 2 1 2 23 11 4 2 22 12 14 2 22 11 43 38 3 

I have provided a sample data, but mine has thousands of records distributed in a similar way. Here, Col1, Col2, Col3, Col4 are my features and Col5 is target variable. Hence prediction should be 1,2,3 or 4 as these are my values for target variable. I have tried using algorithms such as random forest, decision tree etc. for predictions.

Here if you see, values 1,2 and 3 are occurring more times as compared to 4. Hence while predicting, my model is more biased towards 1 2 and 3 whereas I am getting only less number of predictions for 4 (Got only 1 predicted for policy4 out of thousands of records when I saw the confusion matrix).

In order to make my model generalize, I removed equal percentage of data that belongs to 1,2 and 3 value randomly. I grouped by each value in Col5 and then removed certain percentage, so that I brought down the number of records. Now I could see certain increase in percentage of accuracy and also reasonable increase in predictions for value 4 in confusion matrix.

Is this the right approach to deal with (removing the data randomly from those groups on which the model is biased)?

I tried for in-built python algorithms like Adaboost, GradientBoost techniques using sklearn. I read these algorithms are for handling imbalance class. But I couldnt succeed in improving my accuracy, rather by randomly removing the data, where I could see some improvements.

Is this reduction is undersampling technique and is this the right approach for under-sampling?

Is there are any pre-defined packages in sklearn or any logic which I can implement in python to get this done, if my random removal is wrong?

Also, I learnt about SMOTE technique, which deals with oversampling. Should I try this for value 4? And can we do this using any in-built packages in python? It would be great if someone helps me in this situation.

$\endgroup$

    6 Answers 6

    5
    $\begingroup$

    This paper suggests using ranking (I wrote it). Instead of using, for instance, SVM directly, you would use RankSVM. Since rankers compare observation against observation, training is necessarily balanced. There are two "buts" however: training is much slower, and, in the end, what these models do is rank your observations from how likely they are to belong to one class to how likely they are to belong to another so you need to apply a threshold afterwards.

    If you are going to use pre-processing to fix your imbalance I would suggest you look into MetaCost. This algorithm involves building a bagging of models and then changing the class priors to make them balanced based on the hard to predict cases. It is very elegant. The cool thing about methods like SMOTE is that by fabricating new observations, you might making small datasets more robust.

    Anyhow, even though I wrote some things on class imbalance, I am still skeptic that it is an important problem in the real world. I would think it is very uncommon that you have imbalance priors in your training set, but balanced priors in your real world data. Do you? What usually happens is that type I errors are different than type II errors and I would bet most people would be better off using a cost matrix, which most training methods accept or you can apply it by pre-processing using MetaCost or SMOTE. I think many times "fixing imbalance" is short to "I do not want to bother thinking about the relative trade-off between type I and II errors."

    Addendum:

    I tried for in-built python algorithms like Adaboost, GradientBoost techniques using sklearn. I read these algorithms are for handling imbalance class.

    AdaBoost gives better results for class imbalance when you initialize the weight distribution with imbalance in mind. I can dig the thesis where I read this if you want.

    Anyhow, of course, those methods won't give good accuracies. Do you have class imbalance in both your training and your validation dataset? You should use metrics such as F1 score, or pass a cost matrix to the accuracy function. "Fixing" class imbalance is when your priors are different in your training and your validation cases.

    $\endgroup$
      4
      $\begingroup$

      Some of sklearn's algorithms have a parameter called class_weight that you can set to "balanced". That way sklearn will adjust its class weights depending on the number of samples that you have of each class.

      For the random forest classifier, try the following and see if it improves your score:

      rf = RandomForestClassifier(class_weight="balanced") # also add your other parameters! 
      $\endgroup$
      2
      • $\begingroup$(class_weight="balanced") is not giving sufficient improvements when I tried to use it$\endgroup$
        – SRS
        CommentedApr 25, 2016 at 15:49
      • 2
        $\begingroup$@Srinath what do you understand by improvement? What metric are you using? If both your training and your validation is imbalance, you cannot use accuracy scores. What class_weight does is to build a cost matrix for you where for each class $k$, $C_k=2\frac{N_k}{N}$. You should either pass sample_weight=[C_k for k in y] to accuracy_score or use something like f1_score.$\endgroup$CommentedApr 29, 2016 at 13:37
      2
      $\begingroup$

      Yes, this is a fine technique to tackle the problem of class-imbalance. However, under-sampling methods do lead to the loss of information in the data set (say, you just removed an interesting pattern among the remaining variables, which could have contributed to a better training of the model). This is why over-sampling methods are preferred, specifically in case of smaller data set.

      In response to your query regarding Python packages, the imbalanced-learn toolbox is specially dedicated for the same task. It provides several under-sampling and over-sampling methods. I would recommend trying the SMOTE technique.

      $\endgroup$
        1
        $\begingroup$

        It depends on the ensemble technique you want to use. The basic problem that you are working with multi-class data imbalance problem. Under sampling can be used efficiently in bagging as well as in boosting techniques. SMOTE algorithm is very efficient in generating new samples. Data imbalance problem has been widely studied in literature. I recommend you to read about one of these algorithms: SMOTE-Boost SMOTE-Bagging Rus-Boost EusBoost These are boosting /bagging techniques designed specifically for imbalance data problem. Instead of SMOTE you can try ADA-SMOTE or Border-Line SMOTE. I have used and modified the Border-Line SMOTE for multi-class and it is very efficient. If your data base is very large and the problem is easy try : viola - jones classifier. I have used also with data imbalance problem and it is really efficient

        $\endgroup$
        7
        • $\begingroup$Thanks for the guidance.I am looking into the topics mentioned by you. But the technique which I used to undersample (reducing the data randomly) is a right way of doing?$\endgroup$
          – SRS
          CommentedApr 25, 2016 at 15:23
        • $\begingroup$You can use it if your database is very large. But if your database is small you will lose some of the information. Read the Rus-Boosting , in this method they use random under sampling as part of the boosting algorithm to avoid loosing information. They under sample the sub set that will be used for training the next base learner but not the whole database$\endgroup$CommentedApr 25, 2016 at 15:41
        • $\begingroup$My dataset has nearly 80k records which I am using it as training set. I am implementing this in python. I was looking for some packages in sklearn or something else in python. I couldnt find them. Is this something for which I should right some logic in place to have them implementedt?$\endgroup$
          – SRS
          CommentedApr 25, 2016 at 15:56
        • $\begingroup$I do not think there is any implementation for these methods. The data imbalance problem is still under research. If you have a good implementation for Adaboost.M1 or M2 . You can easily modify it to become Rus Boost$\endgroup$CommentedApr 25, 2016 at 16:01
        • $\begingroup$I think the database you have is quite large and if you want you can use viola - jones classifier. For this one you may find available implementation$\endgroup$CommentedApr 25, 2016 at 16:02
        0
        $\begingroup$

        There are already some good answers here. I just thought I would add one more technique since you look to be using ensembles of trees. In many cases you are looking to optimize the Lift curve or the AUC for the ROC. For this I would recommend Hellinger distance criterion for splitting the branches in your trees. At the time of writing this it is not in the imbalanced-learn package but it looks like there is a plan.

        $\endgroup$
          -1
          $\begingroup$

          When dealing with class imbalance problem you should mainly concentrate on error metric and you should choose F1 score as an error metric.

          After choosing the correct metric we can use different Techniques for dealing with this issue.

          If interested you can look into this blog, it is explained very nicely about the techniques used to solve this class imbalance problem.

          $\endgroup$

            Start asking to get answers

            Find the answer to your question by asking.

            Ask question

            Explore related questions

            See similar questions with these tags.