Fixing Python Pandas concat That Resets Your Index and Creates Duplicates
You call pd.concat() on a handful of DataFrames, print the result, and find row numbers that jump from 0 back to 0 again, or a DataFrame that's suddenly three times bigger than it should be. Neither of those is a bug in Pandas itself β both are predictable consequences of how concat works by default, and both are fixable in seconds once you know which knob to turn.
What You'll Learn
- Why
pd.concat()resets the index and when that's actually fine - How to use
ignore_index,verify_integrity, andkeysto control index behavior - The most common source of unexpected duplicate rows after a concat
- How to audit and remove duplicates that have already crept in
- When you should reach for
merge()instead ofconcat()
Prerequisites
You need Python 3.8 or later and Pandas 1.3 or later. The examples also assume you're comfortable creating DataFrames and reading basic tracebacks. No NumPy knowledge is required beyond understanding that Pandas is built on top of it.
What's Actually Happening When concat Misbehaves
pd.concat() is a stacking operation. You hand it a list of DataFrames and it physically appends them, row by row (or column by column if you set axis=1). It does not try to align or deduplicate anything on its own. Whatever indexes those individual DataFrames carry, they all come along for the ride.
That design is intentional. Pandas has no way of knowing whether your index values are meaningful primary keys or just the default 0, 1, 2 range that every new DataFrame starts with. So it makes no assumptions and touches nothing unless you tell it to.
The two symptoms you'll hit most often flow directly from this:
- Index reset / overlap β three DataFrames with indexes 0β9 produce a combined DataFrame with three separate 0β9 sequences, for a final index of [0,1,2,...9, 0,1,2,...9, 0,1,2,...9].
- Duplicate rows β if the same DataFrame (or the same CSV file) ends up in your list twice, every row appears twice in the output.
Why concat Resets the Index by Default
"Resets" is slightly misleading. concat doesn't actively reset anything β it preserves the existing index of every input DataFrame. When all your inputs share the default RangeIndex (0, 1, 2 β¦), stacking them produces an index with repeated values. That's not a reset; it's faithful preservation of something you probably didn't intend to keep.
Here's a minimal example to make this concrete:
import pandas as pd
df1 = pd.DataFrame({'value': [10, 20, 30]})
df2 = pd.DataFrame({'value': [40, 50, 60]})
result = pd.concat([df1, df2])
print(result)
print(result.index.tolist())
Output:
value
0 10
1 20
2 30
0 40
1 50
2 60
[0, 1, 2, 0, 1, 2]
The index [0, 1, 2, 0, 1, 2] is not a mistake β it's exactly what Pandas promised. The problem is that downstream code often expects a unique, monotonically increasing index. Anything that calls df.loc[0] now returns two rows instead of one, which is rarely what you want.
Fixing the Reset Index with ignore_index
If your index has no intrinsic meaning (it's just position), pass ignore_index=True. Pandas discards all input indexes and assigns a fresh RangeIndex to the result.
result = pd.concat([df1, df2], ignore_index=True)
print(result.index.tolist())
# [0, 1, 2, 3, 4, 5]
One line, clean sequential index. This is the right fix roughly 80% of the time when you're combining data from multiple files or API pages where the individual indexes were never meaningful.
A common place this saves you is after reading multiple CSV files in a loop:
import glob
frames = [pd.read_csv(f) for f in glob.glob('data/chunk_*.csv')]
full_df = pd.concat(frames, ignore_index=True)
Without ignore_index=True, every file's 0-based index stacks on top of the previous one. With it, you get a single clean 0-to-N index over all rows.
When You Want to Keep the Original Index
Sometimes the original index is meaningful β a date, a user ID, a database primary key. Throwing it away with ignore_index=True would lose information you need. In that case, keep the index but use the keys parameter to add a level of labeling so you can trace which source each row came from.
df_jan = pd.DataFrame({'revenue': [100, 200]}, index=['2024-01-01', '2024-01-02'])
df_feb = pd.DataFrame({'revenue': [150, 250]}, index=['2024-02-01', '2024-02-02'])
result = pd.concat([df_jan, df_feb], keys=['January', 'February'])
print(result)
Output:
revenue
January 2024-01-01 100
2024-01-02 200
February 2024-02-01 150
2024-02-02 250
The result has a MultiIndex. The outer level tells you which source the row came from; the inner level preserves the original date index. You can drop back to a flat index later with result.reset_index(level=0, drop=True) if you no longer need the source label.
If your dates or IDs are already unique across all inputs, you can also just call result.reset_index(drop=False) after a plain concat to get a clean RangeIndex while moving the original index into a column for safekeeping.
Duplicate Rows: The Other concat Trap
A reordered index is annoying; duplicate rows are actively dangerous. Aggregations silently double-count, joins explode in size, and reports look plausible but are wrong. There are three common ways duplicates enter a concat operation:
1. The same source appears in the list twice
This happens more than you'd expect, especially in scripts that build a list of DataFrames inside a loop and accidentally read the same file twice, or append the same variable twice.
# Bug: df1 added twice
result = pd.concat([df1, df2, df1], ignore_index=True)
Every row from df1 now appears twice. The fix is auditing what's in your list before you concatenate.
2. Overlapping date ranges in time-series data
If you pull monthly snapshots and the export windows overlap by a day, that boundary date appears in both DataFrames. After concat it appears twice in the result, but looks perfectly innocent in the index because the dates match.
3. concat with a merge-like intention
Developers sometimes reach for concat when they actually need merge. If you want to join DataFrames on a shared key column, concat will stack rows rather than match them β creating duplicates for any rows that were already in both DataFrames. More on this in the section below.
Catching Duplicates Before They Spread
The cleanest defense is verify_integrity=True. It makes concat raise a ValueError immediately if the resulting index contains any duplicate values.
try:
result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
print(f"Duplicate index detected: {e}")
Use this in data pipelines and ETL scripts where silently producing duplicates would corrupt downstream output. The cost is a slight performance overhead on large DataFrames, so disable it once you're confident in the pipeline and save it for debugging sessions.
If duplicates have already entered your DataFrame, the standard two-step cleanup is:
# Check how many duplicates exist
print(result.duplicated().sum())
# Drop exact row duplicates, keep the first occurrence
result = result.drop_duplicates()
# Or if duplicates are key-based (same ID, different data), be explicit:
result = result.drop_duplicates(subset=['user_id', 'date'], keep='last')
Always pass subset when you have a logical primary key. Without it, drop_duplicates() only removes rows that are identical across every column, which misses near-duplicates that differ in a trailing whitespace or a float precision artifact. For a related deduplication scenario, see how non-unique keys cause duplicate rows in Pandas merge() β the diagnostic approach is the same.
concat vs merge: Choosing the Right Tool
concat stacks DataFrames physically. merge aligns them on shared key values, like a SQL JOIN. Picking the wrong one is the root cause of many unexpected duplicate situations.
| Situation | Right function |
|---|---|
| Combining rows from multiple sources with the same columns | pd.concat(axis=0) |
| Combining new columns onto existing rows | pd.concat(axis=1) or merge |
| Joining two DataFrames on a shared key column | df.merge() |
| Appending new rows to the bottom of a DataFrame | pd.concat(axis=0) |
If you mean to align on a key but use concat instead, the key column does not act as a join key β Pandas will just stack rows independently, which typically creates a bloated result with lots of NaN values where the columns don't line up, or silently doubles up rows that existed in both frames.
For data cleaning workflows where you're reading from a spreadsheet before the concat step, the pitfalls in reading Excel files where the header isn't on row one are worth reviewing too β malformed headers cause column mismatches that make concat produce far more NaN columns than expected.
Common Pitfalls and Gotchas
Concatenating along axis=1 with mismatched row counts
When you set axis=1 to add columns side by side, Pandas aligns on the row index. If the two DataFrames have different numbers of rows or different index values, you'll get NaN-filled gaps. Always check that both DataFrames share the same index before a column-wise concat, or use join='inner' to keep only matching rows.
result = pd.concat([df_a, df_b], axis=1, join='inner')
Mixing dtypes silently downgrades columns
If one DataFrame has an integer column and another has the same column as float (or object), concat will upcast to the broader dtype. An integer ID column can silently become a float, turning 1001 into 1001.0. Check dtypes on the result with result.dtypes and cast back explicitly if needed.
Empty DataFrames in the list
An empty DataFrame in your list is harmless most of the time, but it can affect column dtype inference if the empty frame was created without specifying dtypes. Filter empty frames out before concatenating:
frames = [df for df in frames if not df.empty]
result = pd.concat(frames, ignore_index=True)
Forgetting to reset the index after drop_duplicates
After calling drop_duplicates(), the index still has gaps where rows were removed (e.g., [0, 1, 3, 5, 6]). If anything downstream assumes a contiguous index, call reset_index(drop=True) after deduplication. This is a near-identical gotcha to the one described for pivot_table outputs that carry unexpected NaN values β the underlying cause is an index that no longer matches what you think it does.
Concatenating inside a loop without pre-allocating
Building a result by repeatedly calling pd.concat inside a loop is slow and memory-heavy because it creates a new DataFrame object on every iteration. Collect frames into a Python list, then do a single concat at the end.
# Slow β avoid
result = pd.DataFrame()
for chunk in chunks:
result = pd.concat([result, chunk])
# Fast β do this
frames = []
for chunk in chunks:
frames.append(chunk)
result = pd.concat(frames, ignore_index=True)
For pipelines that also write the final DataFrame to Excel, column order after a concat can shift unexpectedly β the approach for keeping column order correct when exporting to Excel applies directly here too.
Wrapping Up: Next Steps
Most pd.concat() problems come down to three things: the default index preservation behavior, accidental duplicate sources, and reaching for concat when a merge is what you actually need. Here's what to do right now:
- Add
ignore_index=Trueto everyconcatcall where the individual indexes are just positional and carry no meaning. - Add
verify_integrity=Trueto concat calls in critical pipelines during development to catch index collisions immediately. - Audit your input list β print
len(frames)and spot-check a few entries to make sure no source appears twice. - Run
result.duplicated().sum()on any combined DataFrame before it feeds downstream aggregations. If it's non-zero, investigate before proceeding. - Review dtypes after concat with
result.dtypesand cast any silently upcasted columns back to their intended type.
Frequently Asked Questions
Why does pd.concat() produce duplicate index values instead of a clean sequential index?
By default, pd.concat() preserves the original index of every input DataFrame. When all inputs have the same default RangeIndex starting at 0, the stacked result contains repeated values like [0, 1, 2, 0, 1, 2]. Pass ignore_index=True to get a fresh sequential index over the combined result.
How do I check if pd.concat() introduced duplicate rows into my DataFrame?
Call result.duplicated().sum() on the concatenated DataFrame β if the result is greater than zero, duplicates exist. You can then call result.drop_duplicates() to remove exact duplicates, or pass a subset parameter to deduplicate based on specific key columns.
What does verify_integrity=True do in pd.concat() and when should I use it?
Setting verify_integrity=True makes pd.concat() raise a ValueError immediately if the resulting index contains any duplicate values, rather than silently returning a broken DataFrame. Use it during development and in ETL pipelines where duplicate index entries would corrupt downstream processing.
When should I use pd.concat() instead of df.merge() to combine DataFrames?
Use pd.concat() when you want to stack rows from DataFrames that share the same column structure, such as combining monthly export files. Use df.merge() when you need to align two DataFrames on a shared key column, similar to a SQL JOIN β using concat for a join scenario will stack rows independently rather than matching them.
How do I concatenate DataFrames inside a loop without making it slow?
Avoid calling pd.concat() on every loop iteration, as it creates a new DataFrame object each time and becomes very slow on large datasets. Instead, append each chunk to a Python list inside the loop and call pd.concat(frames, ignore_index=True) once after the loop finishes.
π€ Share this article
Sign in to saveRelated Articles
How-To Guides
Fixing Python xlsxwriter That Silently Skips Conditional Formatting on Filtered Rows
9m read
How-To Guides
Fixing Python xlsxwriter That Ignores Cell Formats When Writing Merged Cells
8m read
How-To Guides
Fixing Python Pandas read_excel That Skips Rows When Header Is Not on Row One
8m read
Comments (0)
No comments yet. Be the first!