I am trying to do highly unbalanced binary classification using Linear Genetic Programming to detect a certain spoken word. I use mel coefficients as features. The instructions include basic arithmetic excluding dividision, sine, cosine and select instructions(a = a if a >= 0 else b).
The positive class is the spoken word, and the negative class is anything else, like just noise or other spoken words.
I have a dataset of about 50K entries. There are about 2K positive entries and the rest are negative. I use about 3K for train, having 500 for positive and 2.5K for negative. During training, depending on the word, I get 90%-99% positive accuracy depending on which word. And I get 100% or close to 100% accuracy for the negative class during training.
As for test part, the test dataset also includes samples of words which are not in the negative train set at all. For instance, if my train negative set are "cat" and "dog", test includes both of them but also about 3K samples another word(s) like "yes".
The problem's that the best found program performs almost as well in training. Positive and negative accuracy are at maximum only 1-2% behind training each.
This looks suspicious. Maybe the program I've written has some bug?