Why SHAP Values Flip Sign Across Similar Samples

You pull SHAP values for two rows that look nearly identical — same age, same income bracket, one digit different in one feature — and the explanation flips completely. Feature X contributes +0.4 to the prediction on the first row and -0.3 on the second. You stare at it for a while and wonder if the library is broken.

It isn't broken. But what's happening reveals something deep about how your model actually works, and ignoring it will lead you to give stakeholders explanations that are subtly wrong.

What you'll learn

How SHAP values are computed and why they are locally defined, not globally stable
The three main reasons a SHAP value can flip sign between similar rows
How interaction effects quietly drive sign flips in tree-based models
Practical debugging steps to trace a flip back to its root cause
When a sign flip is actually the correct, honest answer

Prerequisites

This article assumes you are already using SHAP in a Python project — either shap.TreeExplainer for gradient-boosted trees or shap.LinearExplainer / shap.KernelExplainer for other model types. Basic familiarity with how a prediction is decomposed into feature contributions will help. If you haven't read the original Lundberg & Lee paper, you don't need to — but understanding that SHAP values satisfy the local accuracy property (they sum to the difference between the prediction and the expected model output) is essential context.

SHAP values are local, not global

The most important thing to internalize is that a SHAP value is not a property of a feature. It is a property of a specific prediction. When you compute SHAP for row A, you're asking: given this particular combination of inputs, how much did each feature move the output away from the baseline?

That baseline (the expected model output over your background dataset) stays fixed. But the path from that baseline to the final prediction is carved through a nonlinear model landscape. Two rows that differ by a small amount can sit on opposite sides of a decision boundary, a threshold, or a steep gradient in that landscape. The direction each feature pushes the prediction naturally differs.

Think of it like elevation on a mountain trail. Two hikers standing one meter apart can be on opposite sides of a ridge. The wind pushes one east and the other west, even though they're close together.

Reason 1: You're near a nonlinear boundary

Tree-based models (XGBoost, LightGBM, Random Forest) carve feature space into rectangular regions. When two samples sit in different leaf nodes, even with similar raw feature values, their SHAP attributions are computed from different parts of the tree structure.

Consider a gradient-boosted model predicting loan default. A feature like debt_to_income might have a sharp threshold at 0.43. A value of 0.42 lands in a low-risk leaf; 0.44 lands in a high-risk leaf. The SHAP value for debt_to_income flips from negative (reducing predicted risk) to positive (increasing it) across that boundary. The two samples look nearly identical to you, but to the model they are in entirely different neighborhoods.

import shap
import xgboost as xgb
import numpy as np

# Two near-identical samples straddling a threshold
X_pair = np.array([
    [0.42, 55000, 3],  # debt_to_income just below threshold
    [0.44, 55000, 3],  # debt_to_income just above threshold
])

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_pair)

print("Sample 1 SHAP for debt_to_income:", shap_values[0][0])
print("Sample 2 SHAP for debt_to_income:", shap_values[1][0])

Running this against a real model will often show you a sign flip on debt_to_income and little change on the other features. That's the boundary at work.

Reason 2: Interaction effects are redistributing credit

This one catches people off guard. Even if a feature's value doesn't change between two rows, its SHAP value can flip because another feature changed and they interact inside the model.

Suppose your model uses both age and employment_type. In tree models, there's often a split like: if age > 35 AND employment_type == 'self-employed', increase risk score. For a 36-year-old who is self-employed, age might get a large positive SHAP contribution. For a 36-year-old who is salaried, the same age value gets a near-zero or negative contribution because the interaction condition was never triggered.

The SHAP library can expose this directly through interaction values:

# Requires TreeExplainer with tree model
shap_interaction = explainer.shap_interaction_values(X_pair)

# interaction matrix is shape (n_samples, n_features, n_features)
# diagonal = main effects, off-diagonal = pairwise interactions
print("Age main effect, sample 1:", shap_interaction[0][1][1])
print("Age x employment interaction, sample 1:", shap_interaction[0][1][2])

If the off-diagonal values are large relative to the diagonal, interaction redistribution is the primary driver of your sign flip. This is extremely common in boosted trees that are allowed to grow deep.

Reason 3: The reference baseline doesn't represent your sample well

SHAP values measure deviation from a background distribution. If your background dataset (the data you pass to TreeExplainer or KernelExplainer at initialization) is not representative of the subgroup your two samples belong to, the baseline can be misleading.

Imagine you trained on a balanced dataset but your explainer background is dominated by one class. The expected model output will be skewed, and features that are typical for the minority subgroup will appear to have unusually large SHAP magnitudes — sometimes in unexpected directions — just because the baseline is far from them.

A practical fix is to stratify your background sample:

from sklearn.model_selection import train_test_split

# Use a stratified background that matches your population
background = shap.sample(X_train, 100, random_state=42)
explainer = shap.TreeExplainer(model, background)

This won't eliminate sign flips caused by genuine model nonlinearity, but it removes baseline drift as a confounding factor.

Reason 4: Correlated features splitting the attribution

When two features are highly correlated, SHAP has to distribute credit across both of them. The exact split depends on the order in which the model's splits appear and how the Shapley coalition sampling works out for that specific row.

For two similar rows where the correlation structure happens to be computed slightly differently (due to different coalition paths), one feature might absorb most of the positive contribution in row A while the other absorbs it in row B. The net effect on the prediction is the same, but the individual signs can flip as credit moves between correlated features.

This is a known limitation of SHAP when features are not independent. If this is what you're seeing, the practical response is to group correlated features and interpret them together rather than individually. Some teams use shap.group_feature_contributions() patterns or manually sum attribution across a feature cluster.

How to debug a sign flip systematically

When you spot a flip, work through this checklist before drawing conclusions:

Check the raw prediction values. Are the two samples actually predicted differently? If the model output is nearly the same, a sign flip in individual SHAP values is just internal redistribution — the net is stable.
Isolate the differing feature. Strip the rows down to identify exactly which input differs between them. Sometimes what looks like a similar sample actually has a categorical feature encoded differently after preprocessing.
Plot the SHAP dependence plot for the suspect feature. shap.dependence_plot('feature_name', shap_values, X) will show you whether there's a clear nonlinear break or whether coloring by an interaction feature reveals the real driver.
Compute interaction values. Use shap_interaction_values (TreeExplainer only) to confirm whether the flip is main-effect driven or interaction-driven.
Verify your background dataset. Confirm it is representative of the data range your samples sit in.

# Dependence plot to visualize sign flip region
shap.dependence_plot(
    "debt_to_income",
    shap_values,
    X_display,
    interaction_index="employment_type"  # color by a suspected interacting feature
)

When the sign flip is correct and expected

Not every sign flip is a problem to fix. If your model genuinely learned that a feature is beneficial in one context and harmful in another, SHAP is doing its job by reporting that honestly. A feature like prior_claims in an insurance model might reduce predicted risk for people with one prior claim (shows experience navigating the system) but strongly increase it for people with five. The SHAP value for the same feature should have different signs for those two groups.

The mistake is assuming that a feature should always push the prediction in one direction. That assumption belongs to linear models, not the nonlinear models most teams are running today. If stakeholders expect a single consistent direction for each feature, that expectation needs to be reset with a simple explanation: the model learned conditional relationships, and SHAP faithfully reports them.

SHAP tells you what the model did on this row. It doesn't promise to tell you a simple story about the feature in general.

Common pitfalls to avoid

Averaging SHAP values across groups to explain a sign flip. Averaging washes out the local information SHAP was designed to provide. If you want a global view, use shap.summary_plot on the full population, not a mean of two opposing rows.
Assuming preprocessing is identical for both rows. Check that your pipeline — scalers, encoders, imputers — produces the same transformation for both. A one-hot encoder behaving differently for an unseen category is a common hidden culprit.
Using KernelExplainer when TreeExplainer is available. KernelExplainer is slower and introduces approximation noise. A sign flip in KernelExplainer output can sometimes be numerical noise rather than a model behavior. Use TreeExplainer for tree models.
Comparing SHAP values from different explainer instances. If your two rows were run through explainers initialized on different background datasets, the baselines differ and the values are not comparable.

Wrapping up

SHAP sign flips are not bugs. They are messages from your model telling you that a feature's role is context-dependent. Understanding why they happen makes you a more credible communicator of model behavior — both to technical teammates and to stakeholders who want simple answers.

Here are concrete next steps:

Run shap.dependence_plot on any feature showing sign flips and look for discontinuities or strong coloring by an interaction feature.
For tree models, compute shap_interaction_values on your two problem rows and compare the diagonal versus off-diagonal magnitudes.
Audit your background dataset for class imbalance or distribution mismatch against the rows you're explaining.
If correlated features are splitting attribution in confusing ways, define feature groups and sum attributions within each group before reporting to stakeholders.
Document sign-flip behavior as a model characteristic in your model card or explainability report so downstream users understand the local nature of SHAP.

Why Your SHAP Values Flip Sign Across Similar Samples

What you'll learn

Prerequisites

SHAP values are local, not global

Reason 1: You're near a nonlinear boundary

Reason 2: Interaction effects are redistributing credit

Reason 3: The reference baseline doesn't represent your sample well

Reason 4: Correlated features splitting the attribution

How to debug a sign flip systematically

When the sign flip is correct and expected

Common pitfalls to avoid

Wrapping up

Related Articles

Fixing Batch Normalization That Breaks Your Model at Inference Time

Why Your Precision-Recall Curve Looks Great But Your Model Still Fails

Why Your Validation Loss Plateaus While Training Loss Keeps Falling

Comments (0)

Leave a Comment

Why Your SHAP Values Flip Sign Across Similar Samples

What you'll learn

Prerequisites

SHAP values are local, not global

Reason 1: You're near a nonlinear boundary

Reason 2: Interaction effects are redistributing credit

Reason 3: The reference baseline doesn't represent your sample well

Reason 4: Correlated features splitting the attribution

How to debug a sign flip systematically

When the sign flip is correct and expected

Common pitfalls to avoid

Wrapping up

Related Articles

Fixing Batch Normalization That Breaks Your Model at Inference Time

Why Your Precision-Recall Curve Looks Great But Your Model Still Fails

Why Your Validation Loss Plateaus While Training Loss Keeps Falling

Comments (0)

Leave a Comment

Stay ahead of the curve