Fixing Python Pandas concat That Resets Your Index and Creates Duplicates

June 24, 2026 9 min read 3 views

You call pd.concat() on a handful of DataFrames, print the result, and find row numbers that jump from 0 back to 0 again, or a DataFrame that's suddenly three times bigger than it should be. Neither of those is a bug in Pandas itself β€” both are predictable consequences of how concat works by default, and both are fixable in seconds once you know which knob to turn.

What You'll Learn

  • Why pd.concat() resets the index and when that's actually fine
  • How to use ignore_index, verify_integrity, and keys to control index behavior
  • The most common source of unexpected duplicate rows after a concat
  • How to audit and remove duplicates that have already crept in
  • When you should reach for merge() instead of concat()

Prerequisites

You need Python 3.8 or later and Pandas 1.3 or later. The examples also assume you're comfortable creating DataFrames and reading basic tracebacks. No NumPy knowledge is required beyond understanding that Pandas is built on top of it.

What's Actually Happening When concat Misbehaves

pd.concat() is a stacking operation. You hand it a list of DataFrames and it physically appends them, row by row (or column by column if you set axis=1). It does not try to align or deduplicate anything on its own. Whatever indexes those individual DataFrames carry, they all come along for the ride.

That design is intentional. Pandas has no way of knowing whether your index values are meaningful primary keys or just the default 0, 1, 2 range that every new DataFrame starts with. So it makes no assumptions and touches nothing unless you tell it to.

The two symptoms you'll hit most often flow directly from this:

  • Index reset / overlap β€” three DataFrames with indexes 0–9 produce a combined DataFrame with three separate 0–9 sequences, for a final index of [0,1,2,...9, 0,1,2,...9, 0,1,2,...9].
  • Duplicate rows β€” if the same DataFrame (or the same CSV file) ends up in your list twice, every row appears twice in the output.

Why concat Resets the Index by Default

"Resets" is slightly misleading. concat doesn't actively reset anything β€” it preserves the existing index of every input DataFrame. When all your inputs share the default RangeIndex (0, 1, 2 …), stacking them produces an index with repeated values. That's not a reset; it's faithful preservation of something you probably didn't intend to keep.

Here's a minimal example to make this concrete:

import pandas as pd

df1 = pd.DataFrame({'value': [10, 20, 30]})
df2 = pd.DataFrame({'value': [40, 50, 60]})

result = pd.concat([df1, df2])
print(result)
print(result.index.tolist())

Output:

   value
0     10
1     20
2     30
0     40
1     50
2     60

[0, 1, 2, 0, 1, 2]

The index [0, 1, 2, 0, 1, 2] is not a mistake β€” it's exactly what Pandas promised. The problem is that downstream code often expects a unique, monotonically increasing index. Anything that calls df.loc[0] now returns two rows instead of one, which is rarely what you want.

Fixing the Reset Index with ignore_index

If your index has no intrinsic meaning (it's just position), pass ignore_index=True. Pandas discards all input indexes and assigns a fresh RangeIndex to the result.

result = pd.concat([df1, df2], ignore_index=True)
print(result.index.tolist())
# [0, 1, 2, 3, 4, 5]

One line, clean sequential index. This is the right fix roughly 80% of the time when you're combining data from multiple files or API pages where the individual indexes were never meaningful.

A common place this saves you is after reading multiple CSV files in a loop:

import glob

frames = [pd.read_csv(f) for f in glob.glob('data/chunk_*.csv')]
full_df = pd.concat(frames, ignore_index=True)

Without ignore_index=True, every file's 0-based index stacks on top of the previous one. With it, you get a single clean 0-to-N index over all rows.

When You Want to Keep the Original Index

Sometimes the original index is meaningful β€” a date, a user ID, a database primary key. Throwing it away with ignore_index=True would lose information you need. In that case, keep the index but use the keys parameter to add a level of labeling so you can trace which source each row came from.

df_jan = pd.DataFrame({'revenue': [100, 200]}, index=['2024-01-01', '2024-01-02'])
df_feb = pd.DataFrame({'revenue': [150, 250]}, index=['2024-02-01', '2024-02-02'])

result = pd.concat([df_jan, df_feb], keys=['January', 'February'])
print(result)

Output:

               revenue
January  2024-01-01  100
         2024-01-02  200
February 2024-02-01  150
         2024-02-02  250

The result has a MultiIndex. The outer level tells you which source the row came from; the inner level preserves the original date index. You can drop back to a flat index later with result.reset_index(level=0, drop=True) if you no longer need the source label.

If your dates or IDs are already unique across all inputs, you can also just call result.reset_index(drop=False) after a plain concat to get a clean RangeIndex while moving the original index into a column for safekeeping.

Duplicate Rows: The Other concat Trap

A reordered index is annoying; duplicate rows are actively dangerous. Aggregations silently double-count, joins explode in size, and reports look plausible but are wrong. There are three common ways duplicates enter a concat operation:

1. The same source appears in the list twice

This happens more than you'd expect, especially in scripts that build a list of DataFrames inside a loop and accidentally read the same file twice, or append the same variable twice.

# Bug: df1 added twice
result = pd.concat([df1, df2, df1], ignore_index=True)

Every row from df1 now appears twice. The fix is auditing what's in your list before you concatenate.

2. Overlapping date ranges in time-series data

If you pull monthly snapshots and the export windows overlap by a day, that boundary date appears in both DataFrames. After concat it appears twice in the result, but looks perfectly innocent in the index because the dates match.

3. concat with a merge-like intention

Developers sometimes reach for concat when they actually need merge. If you want to join DataFrames on a shared key column, concat will stack rows rather than match them β€” creating duplicates for any rows that were already in both DataFrames. More on this in the section below.

Catching Duplicates Before They Spread

The cleanest defense is verify_integrity=True. It makes concat raise a ValueError immediately if the resulting index contains any duplicate values.

try:
    result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
    print(f"Duplicate index detected: {e}")

Use this in data pipelines and ETL scripts where silently producing duplicates would corrupt downstream output. The cost is a slight performance overhead on large DataFrames, so disable it once you're confident in the pipeline and save it for debugging sessions.

If duplicates have already entered your DataFrame, the standard two-step cleanup is:

# Check how many duplicates exist
print(result.duplicated().sum())

# Drop exact row duplicates, keep the first occurrence
result = result.drop_duplicates()

# Or if duplicates are key-based (same ID, different data), be explicit:
result = result.drop_duplicates(subset=['user_id', 'date'], keep='last')

Always pass subset when you have a logical primary key. Without it, drop_duplicates() only removes rows that are identical across every column, which misses near-duplicates that differ in a trailing whitespace or a float precision artifact. For a related deduplication scenario, see how non-unique keys cause duplicate rows in Pandas merge() β€” the diagnostic approach is the same.

concat vs merge: Choosing the Right Tool

concat stacks DataFrames physically. merge aligns them on shared key values, like a SQL JOIN. Picking the wrong one is the root cause of many unexpected duplicate situations.

Situation Right function
Combining rows from multiple sources with the same columns pd.concat(axis=0)
Combining new columns onto existing rows pd.concat(axis=1) or merge
Joining two DataFrames on a shared key column df.merge()
Appending new rows to the bottom of a DataFrame pd.concat(axis=0)

If you mean to align on a key but use concat instead, the key column does not act as a join key β€” Pandas will just stack rows independently, which typically creates a bloated result with lots of NaN values where the columns don't line up, or silently doubles up rows that existed in both frames.

For data cleaning workflows where you're reading from a spreadsheet before the concat step, the pitfalls in reading Excel files where the header isn't on row one are worth reviewing too β€” malformed headers cause column mismatches that make concat produce far more NaN columns than expected.

Common Pitfalls and Gotchas

Concatenating along axis=1 with mismatched row counts

When you set axis=1 to add columns side by side, Pandas aligns on the row index. If the two DataFrames have different numbers of rows or different index values, you'll get NaN-filled gaps. Always check that both DataFrames share the same index before a column-wise concat, or use join='inner' to keep only matching rows.

result = pd.concat([df_a, df_b], axis=1, join='inner')

Mixing dtypes silently downgrades columns

If one DataFrame has an integer column and another has the same column as float (or object), concat will upcast to the broader dtype. An integer ID column can silently become a float, turning 1001 into 1001.0. Check dtypes on the result with result.dtypes and cast back explicitly if needed.

Empty DataFrames in the list

An empty DataFrame in your list is harmless most of the time, but it can affect column dtype inference if the empty frame was created without specifying dtypes. Filter empty frames out before concatenating:

frames = [df for df in frames if not df.empty]
result = pd.concat(frames, ignore_index=True)

Forgetting to reset the index after drop_duplicates

After calling drop_duplicates(), the index still has gaps where rows were removed (e.g., [0, 1, 3, 5, 6]). If anything downstream assumes a contiguous index, call reset_index(drop=True) after deduplication. This is a near-identical gotcha to the one described for pivot_table outputs that carry unexpected NaN values β€” the underlying cause is an index that no longer matches what you think it does.

Concatenating inside a loop without pre-allocating

Building a result by repeatedly calling pd.concat inside a loop is slow and memory-heavy because it creates a new DataFrame object on every iteration. Collect frames into a Python list, then do a single concat at the end.

# Slow β€” avoid
result = pd.DataFrame()
for chunk in chunks:
    result = pd.concat([result, chunk])

# Fast β€” do this
frames = []
for chunk in chunks:
    frames.append(chunk)
result = pd.concat(frames, ignore_index=True)

For pipelines that also write the final DataFrame to Excel, column order after a concat can shift unexpectedly β€” the approach for keeping column order correct when exporting to Excel applies directly here too.

Wrapping Up: Next Steps

Most pd.concat() problems come down to three things: the default index preservation behavior, accidental duplicate sources, and reaching for concat when a merge is what you actually need. Here's what to do right now:

  1. Add ignore_index=True to every concat call where the individual indexes are just positional and carry no meaning.
  2. Add verify_integrity=True to concat calls in critical pipelines during development to catch index collisions immediately.
  3. Audit your input list β€” print len(frames) and spot-check a few entries to make sure no source appears twice.
  4. Run result.duplicated().sum() on any combined DataFrame before it feeds downstream aggregations. If it's non-zero, investigate before proceeding.
  5. Review dtypes after concat with result.dtypes and cast any silently upcasted columns back to their intended type.

Frequently Asked Questions

Why does pd.concat() produce duplicate index values instead of a clean sequential index?

By default, pd.concat() preserves the original index of every input DataFrame. When all inputs have the same default RangeIndex starting at 0, the stacked result contains repeated values like [0, 1, 2, 0, 1, 2]. Pass ignore_index=True to get a fresh sequential index over the combined result.

How do I check if pd.concat() introduced duplicate rows into my DataFrame?

Call result.duplicated().sum() on the concatenated DataFrame β€” if the result is greater than zero, duplicates exist. You can then call result.drop_duplicates() to remove exact duplicates, or pass a subset parameter to deduplicate based on specific key columns.

What does verify_integrity=True do in pd.concat() and when should I use it?

Setting verify_integrity=True makes pd.concat() raise a ValueError immediately if the resulting index contains any duplicate values, rather than silently returning a broken DataFrame. Use it during development and in ETL pipelines where duplicate index entries would corrupt downstream processing.

When should I use pd.concat() instead of df.merge() to combine DataFrames?

Use pd.concat() when you want to stack rows from DataFrames that share the same column structure, such as combining monthly export files. Use df.merge() when you need to align two DataFrames on a shared key column, similar to a SQL JOIN β€” using concat for a join scenario will stack rows independently rather than matching them.

How do I concatenate DataFrames inside a loop without making it slow?

Avoid calling pd.concat() on every loop iteration, as it creates a new DataFrame object each time and becomes very slow on large datasets. Instead, append each chunk to a Python list inside the loop and call pd.concat(frames, ignore_index=True) once after the loop finishes.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.