Impact of Adding Imbalanced Data on Model Performance for Different Groups

Question

Suppose I initially have a dataset with 50 samples of type A and 50 samples of type B, each with several features. I built a neural network model using this data and recorded the prediction accuracy for each group (type A and type B). Next, I added another 100 samples of type A to the dataset and retrained the model. I want to compare the performance for each group (type A and type B) before and after adding the additional data. Please note that the model is not trying to predict type A or B, but rather using type A or B as a feature to predict something, like a number.

Now I am curious whether adding the new data will improve the overall model performance. My guess is that adding this new data will improve overall accuracy, because more data is generally better. However, how will it affect type A and type B? I think type A will have better performance for sure, because it gets more data. But how will that affect type B? My guess is that it will neither improve nor worsen type B, because the prediction through A or B will be independent (correct me if I am wrong). I was thinking that if we have two separate models instead of one, adding more type A data will certainly make the first model better and it will not affect the second model, right?

A more general question is, do we have to make sure our training data has an equal feature distribution? I know we will have a data imbalance problem if the outcome distribution is highly skewed, but what about the feature distribution? Is that a requirement?

Any suggestions or insights would be greatly appreciated!

Stack Exchange Network

Impact of Adding Imbalanced Data on Model Performance for Different Groups

0

Hot Network Questions

Impact of Adding Imbalanced Data on Model Performance for Different Groups

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions