Pandas melt and stack Producing Duplicate Rows: Reshaping Pitfalls Fixed
You call df.melt() or df.stack(), glance at the output, and the row count is double what you expected. The data looks fine on the surface, but something is clearly wrong. This is one of those bugs that wastes an afternoon if you don't know where to look.
What you'll learn
- Why
meltandstacksilently produce duplicate rows - How a non-unique index is the most common hidden culprit
- How to detect duplicates before they corrupt your analysis
- Concrete fixes for each scenario, with working code
- How to validate your reshaped DataFrame before moving on
Prerequisites
You should be comfortable with basic Pandas operations — creating DataFrames, selecting columns, and using groupby. The examples below use Pandas 1.5 or later, though the pitfalls exist in older versions too. Python 3.8+ is assumed throughout.
A Quick Recap of melt and stack
melt converts a wide DataFrame into a long one. You pick one or more identifier columns (id_vars) and the rest get "melted" into a single variable column and a value column. It's the go-to tool when you need tidy data for plotting or analysis.
stack rotates the innermost column level of a DataFrame down into the row index, turning column labels into a new index level. Both operations change shape dramatically, and both share a common set of traps.
The Most Common Culprit: A Non-Unique Index
When you import data from a CSV, database query, or a previous merge, Pandas assigns an integer index automatically — but any filtering, concatenation, or reset can leave that index in a state where values repeat. When melt or stack operates on a DataFrame with a non-unique index, it can multiply rows in ways that feel like magic.
Here's a minimal example:
import pandas as pd
# Simulate a DataFrame where the index has been corrupted
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'jan': [100, 200, 300],
'feb': [110, 210, 310],
}, index=[0, 1, 0]) # index 0 appears twice
melted = df.melt(id_vars='name', var_name='month', value_name='sales')
print(melted)
print(f"Row count: {len(melted)}")
You'd expect 6 rows (3 people × 2 months). You still get 6 here, but the index duplication becomes a problem the moment you try to join, merge, or unstack the result later. The real explosion happens with stack:
stacked = df.set_index('name', append=True).stack()
print(stacked)
Because the outer index (0, 1, 0) is not unique, Pandas treats the two rows with index 0 as distinct entries but carries that ambiguity into every downstream operation. Fix it before you reshape:
df = df.reset_index(drop=True) # Give every row a clean, unique integer index
Duplicate Columns Feeding melt
If your DataFrame has two columns with the same name — which is more common than you'd think after a pd.concat across DataFrames that share a column — melt will produce a row for each occurrence of that column name, doubling your output.
import numpy as np
df_a = pd.DataFrame({'id': [1, 2], 'score': [80, 90]})
df_b = pd.DataFrame({'id': [1, 2], 'score': [85, 95]}) # same column name
combined = pd.concat([df_a, df_b], axis=1)
print(combined.columns.tolist()) # ['id', 'score', 'id', 'score']
melted = combined.melt(id_vars='id')
print(f"Row count: {len(melted)}") # 8 rows instead of 4
Check for duplicate column names before melting:
assert combined.columns.is_unique, f"Duplicate columns found: {combined.columns[combined.columns.duplicated()].tolist()}"
If you intentionally combined two DataFrames and want to keep both score columns, rename them first:
combined.columns = ['id', 'score_a', 'id_b', 'score_b']
# then drop or handle 'id_b' as needed
The value_vars Trap in melt
When you pass value_vars explicitly to melt, you're telling Pandas which columns to unpivot. If a column name appears in both id_vars and value_vars, Pandas will include it twice in the output — once as an identifier and once as a value — producing rows that look like duplicates but aren't exactly.
df = pd.DataFrame({
'region': ['North', 'South'],
'q1': [500, 600],
'q2': [550, 620],
})
# Accidental overlap: 'region' in both id_vars and value_vars
bad_melt = df.melt(id_vars=['region'], value_vars=['region', 'q1', 'q2'])
print(bad_melt)
The fix is to make sure id_vars and value_vars are mutually exclusive:
id_cols = ['region']
val_cols = [c for c in df.columns if c not in id_cols]
good_melt = df.melt(id_vars=id_cols, value_vars=val_cols, var_name='quarter', value_name='revenue')
print(good_melt)
MultiIndex Mismatches with stack
When your DataFrame has a MultiIndex on the columns — common after a groupby().agg() with multiple functions — calling stack() without specifying the level can rotate the wrong level and produce far more rows than intended.
df = pd.DataFrame({
('sales', 'mean'): [100, 200],
('sales', 'sum'): [1000, 2000],
('returns', 'mean'): [10, 20],
('returns', 'sum'): [100, 200],
}, index=['StoreA', 'StoreB'])
df.columns = pd.MultiIndex.from_tuples(df.columns)
# Stacking level 0 rotates metric names ('sales', 'returns') into rows
stacked_l0 = df.stack(level=0)
print(stacked_l0)
# Stacking level 1 rotates aggregation functions into rows
stacked_l1 = df.stack(level=1)
print(stacked_l1)
Neither result is wrong on its own, but if you call stack() without a level argument on a MultiIndex column DataFrame, the default is to stack the innermost level, which may not be what you want. Always pass level explicitly when working with MultiIndex columns.
NaN Rows After stack — and Why They Matter
By default, stack() drops rows where all values are NaN. That sounds helpful, but if your DataFrame has a sparse MultiIndex — where not every combination of row index and column index has data — this default behavior can make your output look like it has missing rows rather than duplicates.
The flip side: if you call stack(dropna=False), you'll see every combination, including NaN-filled rows you didn't expect, which can mimic the "duplicate" problem.
df_sparse = pd.DataFrame({
'metric_a': [1.0, None],
'metric_b': [None, 2.0],
}, index=['row1', 'row2'])
print(df_sparse.stack()) # drops NaN rows: 2 rows
print(df_sparse.stack(dropna=False)) # keeps NaN rows: 4 rows
Choose dropna=False only when you need to account for all combinations explicitly, such as when you're going to fill NaN values in the next step.
How to Detect Duplicates Before They Damage Your Analysis
Build a validation step into your reshaping workflow. It takes three lines and has saved hours of debugging time.
def validate_reshaped(df: pd.DataFrame, key_cols: list, label: str = "") -> pd.DataFrame:
"""Assert that the combination of key_cols is unique in df."""
dupes = df.duplicated(subset=key_cols, keep=False)
if dupes.any():
print(f"[{label}] WARNING: {dupes.sum()} duplicate rows on {key_cols}")
print(df[dupes].head(10))
else:
print(f"[{label}] OK — {len(df)} rows, no duplicates on {key_cols}")
return df
# Usage after melt
melted = df.melt(id_vars='name', var_name='month', value_name='sales')
validate_reshaped(melted, key_cols=['name', 'month'], label='melt result')
Run this after every reshape operation until you're confident in the pipeline. When you find duplicates early, you fix the upstream data issue rather than spending hours diagnosing corrupted aggregations downstream.
Common Pitfalls at a Glance
| Pitfall | Symptom | Fix |
|---|---|---|
| Non-unique index | Row count is correct but joins multiply rows | df.reset_index(drop=True) |
| Duplicate column names | melt output is a multiple of expected rows | Assert df.columns.is_unique, rename before melting |
| id_vars / value_vars overlap | Identifier column appears as a value row | Build value_vars as set difference of all columns minus id_vars |
| Wrong stack level on MultiIndex | Unexpected number of rows or columns | Pass level= explicitly |
| stack dropna default | Sparse data appears to have missing or extra rows | Set dropna=False and fill NaN deliberately |
Wrapping Up
Duplicate rows from melt or stack are almost always caused by something upstream: a dirty index, duplicate column names, or ambiguous arguments. The operations themselves are doing exactly what you told them to do.
Here are four concrete actions to take right now:
- Reset the index before reshaping. Add
df = df.reset_index(drop=True)as a standard first step in any reshape pipeline you didn't build from scratch. - Assert column uniqueness before calling melt. One assertion line prevents hours of downstream confusion.
- Build
value_varsprogrammatically. Never hard-code the column list if your data source can change shape over time. - Always pass
level=explicitly to stack when working with MultiIndex columns. The default is fine for simple DataFrames, but explicit is always safer. - Add a validation function to your pipeline. Check uniqueness on the logical key columns immediately after every reshape operation.
📤 Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!