Why Your Pandas groupby Aggregation Is Returning Wrong Numbers

June 19, 2026 9 min read 1 views

You run a groupby().sum(), the output looks plausible, and you move on. Three days later someone flags that the revenue numbers are 40% higher than the source system. Sound familiar? Pandas groupby is one of the most-used tools in data work, and it is also one of the most reliable sources of silent, hard-to-spot errors.

The good news is that these errors fall into a small set of repeatable categories. Once you know what to look for, you can diagnose them in minutes instead of hours.

What you'll learn

  • How Pandas handles NaN values during aggregation and when that behavior bites you
  • Why duplicate rows are the most common cause of inflated aggregates
  • How dtype coercion silently changes numeric results
  • What the observed parameter on Categorical columns does to your group counts
  • How multi-index output causes misaligned results when you merge back to your original DataFrame

Prerequisites

This article assumes you are comfortable with basic Pandas operations — loading data, filtering, and running simple aggregations. Code examples use Pandas 1.5+ (most behavior described here applies to Pandas 2.x as well, with a note where the version matters). You will need numpy installed alongside pandas.

How groupby actually works under the hood

Before debugging wrong numbers, it helps to have a mental model of what groupby actually does. When you call df.groupby("region").sum(), Pandas performs three steps internally: split the DataFrame into sub-frames by the unique values in region, apply the aggregation function to each sub-frame, and combine the results into a new DataFrame.

Every bug described below happens in one of these three steps. Either the split included rows you didn't expect (duplicates, wrong dtypes), the apply step handled edge cases silently (NaN skipping, numeric coercion), or the combine step produced an index structure that misaligns when you join back.

Keeping this split-apply-combine model in mind makes each bug obvious once you know which step it corrupts.

NaN values are silently dropped — until they aren't

Pandas skips NaN values by default in most numeric aggregations. For sum(), that means a column full of NaN returns 0, not NaN. For mean(), it returns the mean of only the non-null values. This is often what you want, but it becomes a problem when you expect a group with no valid data to be flagged rather than silently zeroed out.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "region": ["North", "North", "South"],
    "sales":  [100, np.nan, 200]
})

print(df.groupby("region")["sales"].sum())
# region
# North    100.0   ← NaN row was dropped, not treated as 0
# South    200.0

If that North row should have contributed a zero sale (not a missing sale), your aggregate is understated. The fix is to decide what NaN means in your data before you aggregate: fill nulls with fillna(0) for a zero, or keep them as-is and use min_count=1 on sum() to return NaN when all values in a group are null.

# Return NaN instead of 0 when the whole group is null
df.groupby("region")["sales"].sum(min_count=1)

A related trap: if your grouping key column contains NaN, those rows are excluded from all groups by default. You can change this with dropna=False (available since Pandas 1.1).

df2 = pd.DataFrame({
    "region": ["North", None, "South"],
    "sales":  [100, 50, 200]
})

# Without dropna=False, the row where region is None is lost
print(df2.groupby("region", dropna=False)["sales"].sum())
# region
# North     100
# South     200
# NaN        50   ← now captured

Duplicate rows inflate your aggregates

This is the single most common cause of groupby results that are too high. If your source data has duplicate rows — from a bad join, a repeated API pull, or a poorly written SQL query — every aggregation function that sums or counts will overstate.

The insidious part is that the groupby itself won't error. It just faithfully sums every row it was given, duplicates included.

df3 = pd.DataFrame({
    "region": ["North", "North", "North", "South"],
    "sales":  [100, 200, 100, 300]   # first and third rows are duplicates
})

print(df3.groupby("region")["sales"].sum())
# region
# North    400   ← should be 300 if the duplicate is accidental
# South    300

Always deduplicate before you aggregate. The minimum check is:

# Check for full-row duplicates
print(df3.duplicated().sum())   # how many duplicate rows exist?

# Drop them
df_clean = df3.drop_duplicates()
print(df_clean.groupby("region")["sales"].sum())
# North    300
# South    300

If your duplicates are on a subset of columns (e.g., the same transaction ID appearing twice), use drop_duplicates(subset=["transaction_id"]). This same class of silent error can appear in NumPy operations too — the article on fixing silently incorrect results from NumPy broadcasting mismatches covers how broadcasting can compound this kind of issue.

dtype coercion quietly changes your sums

If a column that should be numeric is stored as object (string), Pandas will either raise an error or do something unexpected depending on the aggregation function and version. With older Pandas, sum() on an object column concatenates strings rather than adding numbers. You end up with "100200300" instead of 600.

df4 = pd.DataFrame({
    "region": ["North", "North", "South"],
    "sales":  ["100", "200", "300"]   # stored as strings
})

print(df4.groupby("region")["sales"].sum())
# region
# North    100200   ← string concatenation, not addition
# South       300

Fix this by coercing to the right dtype before aggregating:

df4["sales"] = pd.to_numeric(df4["sales"], errors="coerce")
print(df4.groupby("region")["sales"].sum())
# North    300.0
# South    300.0

Use errors="coerce" so that non-numeric strings become NaN rather than raising an exception — then handle those nulls intentionally as described in the NaN section above. A quick way to audit all column dtypes before aggregating:

print(df.dtypes)

Any column you plan to sum or average should show int64, float64, or a Pandas nullable integer type like Int64. If it shows object, investigate.

Categorical columns and the observed parameter

If your grouping key is a Pandas Categorical dtype, groupby will by default include every category defined in the column, even ones that have no rows in your current DataFrame. This creates phantom groups with zero counts and empty aggregates.

region_cat = pd.Categorical(["North", "South"], categories=["North", "South", "East", "West"])
df5 = pd.DataFrame({
    "region": region_cat,
    "sales":  [100, 200]
})

print(df5.groupby("region")["sales"].sum())
# region
# North    100
# South    200
# East       0   ← phantom group
# West       0   ← phantom group

In Pandas 1.x the default is observed=False, which includes all categories. In Pandas 2.x this default is changing to observed=True, which only includes categories that actually appear in the data. If you are migrating between versions and your row counts change, this is likely the cause.

Set the parameter explicitly so your code works the same across versions:

# Only include categories present in the data
df5.groupby("region", observed=True)["sales"].sum()

Multi-index output and misaligned merges

When you group by multiple columns, groupby returns a DataFrame with a MultiIndex. If you then try to merge those results back onto your original DataFrame, the join columns won't line up unless you reset the index first.

df6 = pd.DataFrame({
    "region":   ["North", "North", "South", "South"],
    "product":  ["A", "B", "A", "B"],
    "sales":    [100, 200, 150, 250]
})

agg = df6.groupby(["region", "product"])["sales"].sum()
print(agg.index)  # MultiIndex — not a flat column

# Merging back without reset_index will fail or produce NaN-filled columns
df6_merged = df6.merge(agg, on=["region", "product"])  # KeyError

Call reset_index() on the aggregated result before merging:

agg_flat = agg.reset_index()
df6_merged = df6.merge(agg_flat, on=["region", "product"], suffixes=("", "_total"))
print(df6_merged)

Alternatively, use transform() instead of agg() when you need the aggregated value back on every original row — it preserves the index automatically and is less error-prone for this use case:

df6["region_sales_total"] = df6.groupby("region")["sales"].transform("sum")

If you are building analytical pipelines that involve multiple joins and aggregations, the principles in improving analytical insights in data analysis are worth reviewing — many of the structural recommendations apply directly to how you chain Pandas operations.

Common pitfalls checklist

Before trusting any groupby result, run through this list:

  • Check for duplicates first. Run df.duplicated().sum() before you aggregate. If it is non-zero, decide whether to drop or investigate.
  • Audit column dtypes. Any numeric column showing object needs pd.to_numeric() before it goes into a sum or mean.
  • Decide what NaN means. Use fillna() or min_count=1 deliberately. Never let the default behavior make that decision for you.
  • Check your grouping key for nulls. Add dropna=False if rows with a null key should be captured rather than silently discarded.
  • Set observed explicitly on Categorical columns. Do not rely on the version default.
  • Use reset_index() or transform() when you need to merge aggregated values back onto a DataFrame.
  • Cross-check against a known total. After any aggregation, compare agg_result[col].sum() against df[col].sum(). They should match (minus intentional NaN drops).

That last point is the fastest sanity check you can run. If the sums don't match, something in the split or apply step went wrong.

For anyone newer to data analysis workflows in Python, the Python data analyst roadmap for beginners gives useful context on where groupby fits inside a broader analytical pipeline — including the data cleaning steps that prevent these problems upstream.

A note on numeric_only in Pandas 2.x

Pandas 2.0 removed the silent behavior where non-numeric columns were simply ignored during numeric aggregations. In Pandas 1.x, calling groupby().sum() on a mixed-type DataFrame would quietly skip string columns. In Pandas 2.x, it raises a TypeError instead.

If you are seeing new errors after upgrading, select only the numeric columns you need before aggregating:

# Explicit column selection — works on any Pandas version
df.groupby("region")[["sales", "quantity"]].sum()

This is also better practice regardless of version — it makes your intent explicit and avoids accidental aggregation of columns you didn't mean to include. You can learn more about the full landscape of analysis tools and where Pandas fits among them in this overview of the top data analysis tools and their use cases.

Wrapping up: next steps

Incorrect groupby results almost always trace back to one of five root causes: NaN handling assumptions, duplicate input rows, dtype mismatches, unexpected Categorical behavior, or index misalignment on merge. None of them are obvious at first glance, which is why they survive code review.

Here are the concrete actions to take right now:

  1. Add a pre-aggregation duplicate check to any pipeline that sources data from a join or an API pull. One line — assert df.duplicated().sum() == 0 — will catch the most common cause of inflated totals.
  2. Audit dtypes at ingestion time, not at aggregation time. Fix string-encoded numbers as soon as the data is loaded so they never reach a sum().
  3. Add a post-aggregation totals check. Compare df[col].sum() to grouped[col].sum().sum() after every aggregation step.
  4. Pin the observed parameter on any groupby that uses a Categorical key, especially if your code runs across Pandas 1.x and 2.x environments.
  5. Prefer transform() over agg() + merge() whenever you are adding aggregated columns back to the original DataFrame — it eliminates the entire class of multi-index join errors.

Frequently Asked Questions

Why does pandas groupby sum return 0 instead of NaN when all values are missing?

Pandas sum() skips NaN values by default and returns 0 for a group where all values are NaN. To get NaN instead, use the min_count=1 parameter: groupby(...).sum(min_count=1), which requires at least one non-null value before returning a result.

How do I include NaN keys in a pandas groupby instead of dropping them?

By default, pandas excludes rows where the grouping key is NaN. Pass dropna=False to groupby() — available since Pandas 1.1 — to include those rows as their own group labeled NaN.

Why does my pandas groupby result have more rows than expected when using a Categorical column?

Categorical columns cause groupby to include every defined category, even ones with no data in the current DataFrame. Set observed=True in the groupby call to include only categories that actually appear in the data.

What is the difference between groupby agg and transform in pandas?

agg() reduces each group to a single summary row and returns a smaller DataFrame with a new index. transform() returns a Series or DataFrame with the same index as the original, broadcasting the aggregated value back to every row — making it safer for adding summary columns without a merge.

Why are my pandas groupby numbers too high after joining two DataFrames?

A join that produces duplicate rows is the most common reason. When two tables join on a non-unique key, rows multiply, and any subsequent sum or count will be inflated. Always run df.duplicated().sum() after a merge and before aggregating to catch this.

📤 Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

📬 Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.