Spotting and Fixing Label Noise in ML Training Data

Your model's validation accuracy plateaued two epochs ago and you can't figure out why. You've tuned hyperparameters, tried different architectures, and added regularization. The problem might not be your model at all — it might be your labels. A dataset with even five to ten percent mislabeled examples can silently degrade performance in ways that look exactly like an underfitting or overfitting problem.

What you'll learn

What label noise is and the three main types you'll encounter
How to detect noisy labels using loss-based and cross-validation methods
How to measure the actual impact of noise on your model
Practical strategies for cleaning or downweighting noisy examples
Common pitfalls to avoid when applying label-cleaning pipelines

What Label Noise Actually Is

A label is noisy when the ground-truth class assigned to a training example is wrong. That sounds obvious, but in practice it sneaks in through annotation fatigue, ambiguous class boundaries, rushed crowdsourced labeling, or even programmatic mistakes in data pipelines.

There are three flavors worth knowing:

Random noise — a label is flipped to any other class with some probability, independent of the input features. This is the easiest to model mathematically but rare in real datasets.
Systematic (class-conditional) noise — class A examples frequently get labeled as class B because the two classes look similar. Common in medical imaging and sentiment analysis.
Instance-dependent noise — the probability of mislabeling depends on the actual content of the example. This is the most realistic and hardest to handle.

Prerequisites

The examples below use Python with scikit-learn, numpy, and cleanlab. Install them if you haven't already:

pip install scikit-learn numpy cleanlab

You should be comfortable with basic classification workflows and know what a cross-validated predicted probability is. Nothing here requires deep learning, though the same ideas apply to neural networks.

Detecting Noise with Loss Inspection

The fastest first pass is to look at per-sample training loss after a model has converged. Genuinely mislabeled examples tend to have consistently high loss because the model learns the true pattern in the data and then struggles to fit the wrong label.

Train a simple baseline model and collect the per-sample cross-entropy loss on the training set:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize

def per_sample_log_loss(y_true, y_prob):
    classes = np.unique(y_true)
    y_bin = label_binarize(y_true, classes=classes)
    # Clip probabilities to avoid log(0)
    y_prob = np.clip(y_prob, 1e-7, 1 - 1e-7)
    return -np.sum(y_bin * np.log(y_prob), axis=1)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probs = model.predict_proba(X_train)

losses = per_sample_log_loss(y_train, probs)
# Flag the top 5% as suspicious
threshold = np.percentile(losses, 95)
suspicious_idx = np.where(losses > threshold)[0]
print(f"{len(suspicious_idx)} suspicious examples found")

This is a rough heuristic, not a definitive detector. Hard examples near a decision boundary also get high loss, so you'll need a second filter before doing anything drastic.

Cross-Validation Confidence Scores

A more reliable method is to use out-of-fold predicted probabilities. If a model trained on everything except a given example still assigns low probability to that example's label, that's a strong signal the label is wrong.

from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np

def get_oof_probabilities(X, y, n_splits=5):
    n_classes = len(np.unique(y))
    oof_probs = np.zeros((len(y), n_classes))
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    for train_idx, val_idx in skf.split(X, y):
        model = LogisticRegression(max_iter=1000)
        model.fit(X[train_idx], y[train_idx])
        oof_probs[val_idx] = model.predict_proba(X[val_idx])

    return oof_probs

oof_probs = get_oof_probabilities(X_train, y_train)
# Confidence in the given label for each sample
label_confidence = oof_probs[np.arange(len(y_train)), y_train]
low_confidence_idx = np.where(label_confidence < 0.1)[0]

Examples where the out-of-fold model assigns less than ten percent probability to the annotated label are worth a manual review, especially in smaller datasets where every sample counts.

Using Cleanlab for Automated Detection

The cleanlab library formalizes the cross-validation approach into a framework called Confident Learning. It estimates the joint distribution of noisy labels and true labels, then ranks examples by their likelihood of being mislabeled.

from cleanlab.filter import find_label_issues
import numpy as np

# oof_probs must be out-of-fold probabilities, not in-sample
label_issues = find_label_issues(
    labels=y_train,
    pred_probs=oof_probs,
    return_indices_ranked_by="self_confidence"
)

print(f"Cleanlab flagged {len(label_issues)} potential label issues")
print("Top 10 most suspicious indices:", label_issues[:10])

find_label_issues returns indices sorted by how suspicious the label is. The self_confidence ranking sorts by the model's predicted probability for the given label — lower means more suspicious.

Measuring the Impact Before You Fix Anything

Before removing or relabeling anything, you need a baseline to confirm that noise is actually hurting you. The simplest approach is to inject a known amount of artificial noise into a clean subset, retrain, and observe the accuracy curve.

def inject_noise(y, noise_rate, random_state=42):
    rng = np.random.default_rng(random_state)
    y_noisy = y.copy()
    n_classes = len(np.unique(y))
    n_flip = int(len(y) * noise_rate)
    flip_idx = rng.choice(len(y), size=n_flip, replace=False)
    for i in flip_idx:
        # Flip to a random different class
        other_classes = [c for c in range(n_classes) if c != y_noisy[i]]
        y_noisy[i] = rng.choice(other_classes)
    return y_noisy

for rate in [0.0, 0.05, 0.10, 0.20]:
    y_noisy = inject_noise(y_train, noise_rate=rate)
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_noisy)
    acc = model.score(X_test, y_test)
    print(f"Noise rate {rate:.0%} -> Test accuracy: {acc:.4f}")

If your accuracy drops sharply at noise rates similar to what you estimated in your real dataset, you have strong justification to invest time in cleaning. If accuracy barely moves, noise may not be your bottleneck right now.

Strategies for Handling Noisy Labels

Once you've confirmed noise is a problem, you have several options depending on dataset size and annotation budget.

Remove flagged examples

The bluntest fix: drop the suspicious indices and retrain. This works well when your dataset is large enough that losing a few percent of samples doesn't starve the model of information.

clean_mask = np.ones(len(y_train), dtype=bool)
clean_mask[label_issues] = False

X_clean = X_train[clean_mask]
y_clean = y_train[clean_mask]

Downweight rather than remove

If you can't afford to discard examples, assign a sample weight inversely proportional to suspicion. Many scikit-learn estimators accept a sample_weight parameter.

weights = np.ones(len(y_train))
weights[label_issues] = 0.1  # Downweight, not remove

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train, sample_weight=weights)

Relabel with a second annotator

For high-value examples you don't want to lose, use the suspicious list as a queue for a second human review pass. You get more signal per annotation dollar by targeting suspected errors rather than random sampling.

Use noise-robust loss functions

For deep learning workflows, symmetric cross-entropy and generalized cross-entropy losses are designed to be less sensitive to noisy labels. This is a model-side mitigation rather than a data-cleaning approach, and it pairs well with the detection methods above.

Common Pitfalls

Removing ambiguous examples, not just mislabeled ones. Hard examples near class boundaries also score as suspicious. Blindly dropping everything flagged by a loss heuristic can hurt generalization on genuinely difficult inputs. Always do a spot-check of flagged samples before mass deletion.
Using in-sample probabilities instead of out-of-fold. If the model has already seen an example during training, it will assign high confidence to the training label even if it's wrong. Always use held-out predictions.
Cleaning the test set. Only ever clean training data. If you modify your test labels based on model output, your evaluation is no longer trustworthy.
Treating noise rate estimation as exact. Methods like Confident Learning give you an estimate, not a ground truth. Treat the output as a ranked list of candidates, not a certified list of errors.
Running one cleaning pass and stopping. Noise detection improves when you iterate: clean a round, retrain, re-detect. Two or three passes often surface errors that the first model was too noisy to find.

Wrapping Up

Label noise rarely announces itself. Your model just underperforms in ways that look like every other problem. The good news is that systematic detection is straightforward once you have out-of-fold predicted probabilities in hand.

Here are four concrete actions to take next:

Run the out-of-fold confidence scoring on your current training set and inspect the bottom five percent of examples manually. You'll likely find some genuine errors within the first ten samples.
Use cleanlab's find_label_issues as a second pass to cross-check your manual inspection. Where both methods flag the same example, the label is almost certainly wrong.
Inject synthetic noise at your estimated real-world noise rate and measure the accuracy drop. Use this as the business case for investing annotation time in cleaning.
Adopt an iterative cleaning loop: clean, retrain, detect again, repeat until the flagged count stabilizes.
Document every example you remove or relabel, including the reason. This audit trail is useful when a colleague asks why certain samples aren't in the training set.

Spotting and Fixing Label Noise Before It Corrupts Your Model Training

What you'll learn

What Label Noise Actually Is

Prerequisites

Detecting Noise with Loss Inspection

Cross-Validation Confidence Scores

Using Cleanlab for Automated Detection

Measuring the Impact Before You Fix Anything

Strategies for Handling Noisy Labels

Remove flagged examples

Downweight rather than remove

Relabel with a second annotator

Use noise-robust loss functions

Common Pitfalls

Wrapping Up

Related Articles

Fixing Feature Importance Scores That Mislead You With Correlated Inputs

Why Your Train-Test Split Is Leaking Data and How to Catch It

Debugging Gradient Vanishing in Deep Networks Without Rewriting Your Architecture

Comments (0)

Leave a Comment

Spotting and Fixing Label Noise Before It Corrupts Your Model Training

What you'll learn

What Label Noise Actually Is

Prerequisites

Detecting Noise with Loss Inspection

Cross-Validation Confidence Scores

Using Cleanlab for Automated Detection

Measuring the Impact Before You Fix Anything

Strategies for Handling Noisy Labels

Remove flagged examples

Downweight rather than remove

Relabel with a second annotator

Use noise-robust loss functions

Common Pitfalls

Wrapping Up

Related Articles

Fixing Feature Importance Scores That Mislead You With Correlated Inputs

Why Your Train-Test Split Is Leaking Data and How to Catch It

Debugging Gradient Vanishing in Deep Networks Without Rewriting Your Architecture

Comments (0)

Leave a Comment

Stay ahead of the curve