Feature Importance With Correlated Inputs: How to Fix Misleading Score

You trained a model, checked the feature importance chart, and the top features look plausible — until you notice that two nearly identical columns are both ranked near the bottom, each carrying half the weight they should. Correlation is splitting the credit between them, and your importance scores are lying to you.

This is one of the most common and quietly damaging problems in applied machine learning. The model may still predict well, but the explanation is wrong, and wrong explanations lead to wrong decisions about which features to collect, which to drop, and what the model is actually doing.

What you'll learn

Why correlated features distort standard feature importance scores
How to detect problematic correlations before they mislead you
The difference between impurity-based, permutation, and SHAP-based importance
Practical techniques to correct or account for correlation
When to drop features versus when to keep them and adjust your analysis

Why Correlation Breaks Feature Importance

Most tree-based models compute feature importance by measuring how much each feature reduces impurity across all splits — this is called mean decrease in impurity (MDI), or sometimes Gini importance. The fundamental problem: when two features carry the same information, the model can use either one at each split. It spreads its usage across both, so each feature looks less important than it actually is.

Imagine you have income_annual and income_monthly in your dataset, correlated at 0.99. The tree might split on income_annual at the top of some trees and on income_monthly at others. The actual predictive information is fully captured by either one, but the importance chart shows both at half-strength. If you use that chart to decide which features to invest in collecting, you undervalue income entirely.

The same effect happens with less obvious correlations — age and years of work experience, page views and session duration, temperature and humidity in sensor data. Any pair with a Pearson correlation above roughly 0.7 is a candidate for this problem.

The Three Types of Feature Importance (and Their Blind Spots)

Impurity-based (MDI)

This is the default feature_importances_ attribute on scikit-learn's RandomForestClassifier and GradientBoostingClassifier. It is fast and intuitive, but it has two known problems: it favors high-cardinality features, and it distributes importance unevenly across correlated pairs. Do not use it as your only signal.

Permutation importance

Permutation importance shuffles one feature at a time and measures the drop in model performance. This is more reliable than MDI for correlated features — but only partially. If two features are highly correlated, shuffling one of them doesn't fully break the signal, because the model can lean on the other. You'll still underestimate the combined importance of a correlated group.

SHAP values

SHAP (SHapley Additive exPlanations) assigns each feature a contribution to each individual prediction. It is theoretically grounded and handles nonlinear models well. With correlated features, SHAP distributes credit between them in a mathematically consistent way, but the distributed credit can still mask how important the underlying concept is. You need to read correlated SHAP values as a group, not individually.

Step 1 — Detect Correlation in Your Feature Set

Before you trust any importance chart, check the correlation structure. For continuous features, a heatmap of Pearson correlations is the fastest starting point.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assume X is your feature DataFrame
corr_matrix = X.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(
    corr_matrix,
    annot=False,
    cmap="coolwarm",
    center=0,
    vmin=-1,
    vmax=1
)
plt.title("Feature Correlation Matrix")
plt.tight_layout()
plt.show()

To get a quick list of pairs above a threshold rather than reading a heatmap:

threshold = 0.75

# Get the upper triangle to avoid duplicate pairs
upper = corr_matrix.where(
    pd.np.triu(pd.np.ones(corr_matrix.shape), k=1).astype(bool)
)

high_corr_pairs = [
    (col, row, upper.loc[row, col])
    for col in upper.columns
    for row in upper.index
    if abs(upper.loc[row, col]) > threshold
]

for a, b, r in sorted(high_corr_pairs, key=lambda x: -abs(x[2])):
    print(f"{a} <-> {b}: {r:.3f}")

For categorical or mixed data, consider mutual information instead of Pearson correlation. Scikit-learn's mutual_info_classif and mutual_info_regression work well here.

Step 2 — Compute Permutation Importance Correctly

Once you've trained your model, compute permutation importance on a held-out validation set, not the training set. Using training data inflates the scores for features the model memorized.

from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

result = permutation_importance(
    model, X_val, y_val,
    n_repeats=30,       # repeat shuffling to get stable estimates
    random_state=42,
    scoring="roc_auc"
)

importances = pd.Series(
    result.importances_mean,
    index=X.columns
).sort_values(ascending=False)

print(importances.head(15))

The n_repeats=30 setting matters. With fewer repeats, permutation importance has high variance and correlated features can produce inconsistent rankings across runs.

Step 3 — Use SHAP for Individual-Level Insight

SHAP values show you what each feature contributes to each prediction. This makes it easier to spot when two correlated features are both contributing in the same direction — a sign they're measuring the same thing.

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val)

# For binary classification, shap_values is a list [class_0, class_1]
# Use index 1 for the positive class
shap.summary_plot(shap_values[1], X_val, plot_type="bar")

Look at the summary plot and identify features with similar SHAP distributions. If income_annual and income_monthly have nearly identical bar heights and the same directional pattern in the beeswarm plot, they're sharing credit for the same signal.

Step 4 — Group Correlated Features and Measure Group Importance

The cleanest fix for correlated-feature credit-splitting is to measure the importance of the entire group at once. You shuffle all correlated features together and observe the performance drop. This tells you how much the group matters, regardless of how the model divided attention between its members.

import numpy as np
from sklearn.metrics import roc_auc_score

def group_permutation_importance(model, X, y, groups, n_repeats=20, scoring=roc_auc_score):
    """
    groups: dict of {group_name: [list of column names]}
    Returns a dict of {group_name: mean importance score}
    """
    baseline = scoring(y, model.predict_proba(X)[:, 1])
    results = {}

    for group_name, cols in groups.items():
        drops = []
        for _ in range(n_repeats):
            X_permuted = X.copy()
            # Shuffle all columns in the group together (same row permutation)
            perm_idx = np.random.permutation(len(X_permuted))
            X_permuted[cols] = X_permuted[cols].values[perm_idx]
            score = scoring(y, model.predict_proba(X_permuted)[:, 1])
            drops.append(baseline - score)
        results[group_name] = np.mean(drops)

    return results

# Example usage
feature_groups = {
    "income": ["income_annual", "income_monthly"],
    "engagement": ["page_views", "session_duration", "click_rate"],
    "age_experience": ["age", "years_experience"]
}

group_scores = group_permutation_importance(model, X_val, y_val, feature_groups)
for name, score in sorted(group_scores.items(), key=lambda x: -x[1]):
    print(f"{name}: {score:.4f}")

Shuffling all group members with the same row permutation is the key detail. If you shuffle them independently, the correlation between them breaks in the permuted data but not in real data, which produces a different kind of distortion.

Step 5 — Decide Whether to Drop, Combine, or Keep

Once you know a correlated group carries meaningful importance, you have three options:

Drop redundant features: If two features are correlated above 0.95 and they represent the same concept (like annual vs monthly income), keep the one that is easier to collect or explain. The model won't lose predictive power and your importance scores will be cleaner.
Combine them: Create a single feature via PCA, averaging, or domain knowledge. For example, a composite

Fixing Feature Importance Scores That Mislead You With Correlated Inputs

What you'll learn

Why Correlation Breaks Feature Importance

The Three Types of Feature Importance (and Their Blind Spots)

Impurity-based (MDI)

Permutation importance

SHAP values

Step 1 — Detect Correlation in Your Feature Set

Step 2 — Compute Permutation Importance Correctly

Step 3 — Use SHAP for Individual-Level Insight

Step 4 — Group Correlated Features and Measure Group Importance

Step 5 — Decide Whether to Drop, Combine, or Keep

Related Articles

Why Your Train-Test Split Is Leaking Data and How to Catch It

Debugging Gradient Vanishing in Deep Networks Without Rewriting Your Architecture

Diagnosing Silent Class Imbalance Bugs That Skew Your Model Metrics

Comments (0)

Leave a Comment

Fixing Feature Importance Scores That Mislead You With Correlated Inputs

What you'll learn

Why Correlation Breaks Feature Importance

The Three Types of Feature Importance (and Their Blind Spots)

Impurity-based (MDI)

Permutation importance

SHAP values

Step 1 — Detect Correlation in Your Feature Set

Step 2 — Compute Permutation Importance Correctly

Step 3 — Use SHAP for Individual-Level Insight

Step 4 — Group Correlated Features and Measure Group Importance

Step 5 — Decide Whether to Drop, Combine, or Keep

Related Articles

Why Your Train-Test Split Is Leaking Data and How to Catch It

Debugging Gradient Vanishing in Deep Networks Without Rewriting Your Architecture

Diagnosing Silent Class Imbalance Bugs That Skew Your Model Metrics

Comments (0)

Leave a Comment

Stay ahead of the curve