11
$\begingroup$

My data set consists of an output variable which is categorical with 4 different values and the input variables of which there are roughly 100 and they are boolean, ie True/False. The data set has about 10 millions rows. This looks like a typical use case for a neural network. There is still a good amount of work to be done to get a useful prediction but this is well documented.

What is different is what I intend to do with the model. I don't want to apply this to some new data. Rather I suspect that a small amount of my data set is mislabelled (say less than 0.1%). What I want to do is to find candidates in the training data that look like the output variable is false. These candidates would then still have to be evaluated by a human, the goal is to help the human decide which candidates to look at.

Can such a method work? What are keywords to search for (anything I tried just talks about errors in the model)? I'm mostly looking for references or examples.

$\endgroup$
6
  • 5
    $\begingroup$Maybe look into anomaly detection? An AutoEncoder could do the job. Train it on a correctly labeled subset and see where the reconstruction fails.$\endgroup$
    – pyrochlor
    CommentedApr 1 at 10:53
  • $\begingroup$@pyrochlor The problem is that I don't know which ones are correctly labeled. I know the error rate is fairly low but I don't have a training set where I know that it is correctly labeled.$\endgroup$
    – quarague
    CommentedApr 1 at 11:47
  • 1
    $\begingroup$Yeah not knowing which ones are anomalies is the standard setting for anomaly detection. I agree with @pyrochlor. But if you have humans sometimes look at specific rows and come to a conclusion... it sounds like you do have some labeled training data after all?$\endgroup$CommentedApr 1 at 19:07
  • 2
    $\begingroup$cleanlab is a useful library for discovering and remediating label errors in a fixed dataset, using methods described in peer-reviewed publications. github.com/cleanlab/cleanlab$\endgroup$
    – Sycorax
    CommentedApr 2 at 3:59
  • $\begingroup$@Sycorax Good call ! I have used this method myself for exactly this use case.$\endgroup$
    – Lynchian
    CommentedApr 2 at 9:09

1 Answer 1

9
$\begingroup$

Q1: Can such a method work?

Yes. If only a few labels in your training data are incorrect and you've used early stopping, your model should halt before overfitting and still generalize well.

To find mislabeled data:

  1. Run inference on your training set.
  2. Compare predictions with labels.
  3. Calculate the loss for each sample.
  4. Either:
  • Use a fixed loss threshold, or
  • Sort samples by loss (descending) and review the top entries — these are likely mislabeled.

You can also plot loss vs. sample index. A sudden "jump" often indicates where the suspicious entries begin.

Cleaning these up typically gives your model a small performance boost.

Note: If you don't want to fully train a model on the non-cleaned data, it might work if you just use a smaller model trained on a smaller subset of the dataset or with fewer epochs. This will result in more false positives, but might still work for you to sort out easy to spot faulty entries.

Sidenote

The per-sample loss (after training) reflects how "difficult" a sample is — i.e., how much it deviates from the learned pattern.

You can use this to:

  1. Sort data by loss (ascending), and
  2. Skip shuffling during an initial pretraining phase.

This forms a curriculum-like learning setup: the model sees easy examples first, then harder ones — usually more stable and efficient.

Later, reintroduce randomization or gradually add harder samples to avoid forgetting the basics. This can enable higher learning rates and faster convergence.

Q2: What are keywords to search for (anything I tried just talks about errors in the model)? I'm mostly looking for references or examples.

Keywords:

  • data cleaning
  • data cleansing

Topic "anomaly detection" is related, but not the same. Anomaly detection is more about detecting unexpected input data. For example, you trained your model for images of cats and dogs to label them correctly and then you want to detect cases where this look like neither cat or dog... e.g. an image of a horse or human is presented to the model. Anomaly detection can help to detect such cases, but while it is similar, it does not really detect wrongly labeled data. It would help however with entries which do not fit any of the predefined 100 categories.

$\endgroup$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.