Pandas groupby Giving Wrong Results: Common Pitfalls and Fixes
Your groupby ran cleanly, no exceptions, and the output DataFrame looks reasonable — but the numbers don't match what you calculated by hand. That's one of the more frustrating bugs in data work because there's no error to chase. The problem is in how you called groupby, not in your data.
Most groupby surprises come from a handful of predictable patterns: NaN rows disappearing quietly, multi-column group keys producing unexpected cartesian-style combinations, or using apply where transform is what you actually needed. Once you've seen these patterns, you'll recognize them instantly.
- Why NaN values in group keys silently drop rows
- How
sort=Trueandobserved=Falsecan distort results you then use downstream - The difference between
agg,transform, andapply— and when each one misleads you - How duplicate index values after a merge cause inflated aggregation totals
- How to verify your
groupbyoutput before it reaches a dashboard or report
Prerequisites
You should be comfortable with basic Pandas operations: loading a DataFrame, slicing columns, and running simple aggregations. The examples below use Pandas 1.5+ and Python 3.9+. Most behaviors described here exist in older versions too, but the exact warning messages may differ.
NaN in Group Keys Silently Drops Rows
This is the most common silent data loss in groupby. By default, if a row's group key column contains NaN, that row is excluded from every group — no warning, no error.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'region': ['North', 'South', 'North', None, 'South'],
'sales': [100, 200, 150, 400, 300]
})
result = df.groupby('region')['sales'].sum()
print(result)
# region
# North 250
# South 500
# dtype: int64
Notice that the row with sales=400 simply vanished. The total should be 1150, but you got 750. If you're not checking row counts before and after, this will slip through.
The fix is dropna=False, introduced in Pandas 1.1:
result = df.groupby('region', dropna=False)['sales'].sum()
print(result)
# region
# North 250
# South 500
# NaN 400
# dtype: int64
Now the NaN group appears as its own bucket. You can then decide whether to impute, label, or filter those rows explicitly rather than having them vanish.
Categorical Columns and observed=False
When your group key is a Pandas Categorical dtype, groupby defaults to observed=False in older Pandas versions (prior to 2.0) — meaning it returns a row for every category that exists in the dtype, even if no data rows belong to it. In Pandas 2.0 the default changed to observed=True, but if you're on an earlier version you'll see phantom zero rows.
df['region'] = pd.Categorical(
df['region'].fillna('Unknown'),
categories=['North', 'South', 'East', 'West', 'Unknown']
)
result = df.groupby('region', observed=False)['sales'].sum()
print(result)
# region
# North 250
# South 500
# East 0 <-- no data, but appears anyway
# West 0 <-- same
# Unknown 400
This becomes dangerous when you compute percentages or ratios downstream. East and West contribute zeros to totals but still appear as rows, which inflates your group count and can produce divide-by-zero issues. Use observed=True unless you specifically need the empty categories.
Using agg vs transform: Choosing the Wrong One
agg reduces each group to a single scalar and returns a shorter DataFrame. transform returns a result with the same index shape as the original DataFrame, broadcasting the group result back onto every row. Mixing them up is a frequent source of bugs.
When agg misleads you
If you call agg and then try to join the result back to the original DataFrame by index, you're likely to hit an index mismatch or accidental reindex that fills with NaN.
# Risky pattern: computing group mean then merging back
group_mean = df.groupby('region')['sales'].agg('mean')
df['group_mean'] = df['region'].map(group_mean) # works but fragile
The clean way with transform
# Correct pattern: transform keeps the original index
df['group_mean'] = df.groupby('region')['sales'].transform('mean')
print(df)
# region sales group_mean
# 0 North 100 125.0
# 1 South 200 250.0
# 2 North 150 125.0
# 3 Unknown 400 400.0
# 4 South 300 250.0
Every row gets the mean of its group's sales, aligned correctly by index. No merge required, no risk of row duplication from an accidental many-to-many join.
apply Can Run Your Function Multiple Times
When you pass a custom function to groupby().apply(), Pandas may call it more than once on the first group to detect the return type. If your function has side effects — writing to a list, incrementing a counter, calling an API — you'll get duplicated effects.
call_log = []
def logged_sum(group):
call_log.append(len(group))
return group['sales'].sum()
df.groupby('region').apply(logged_sum)
print(call_log)
# [2, 2, 1, 1] <-- first group called twice
This is a known Pandas behavior documented in the source, not a bug. The fix is to avoid side effects inside apply functions entirely. If you need to log or accumulate something, do it outside the apply call, using the resulting Series or DataFrame as input.
Duplicate Index Values Inflate Aggregations
After a merge or concat operation, your DataFrame may contain duplicate index values. When you then call groupby on a column, the aggregation still runs over those rows — but if you later do an index-based operation (like .loc on the result), you'll get unexpected multi-row returns.
df_a = pd.DataFrame({'id': [1, 2], 'sales': [100, 200]}, index=[0, 1])
df_b = pd.DataFrame({'id': [1, 2], 'region': ['North', 'South']}, index=[0, 1])
# Incorrect merge that resets nothing
merged = pd.concat([df_a, df_b], axis=0) # stacks rows, not joins
print(merged.groupby('id')['sales'].sum())
# id
# 1.0 100.0
# 2.0 200.0
# NaN NaN <-- region rows had no sales column
The real danger is when a many-to-one merge is accidentally done as many-to-many, duplicating rows before aggregation and inflating every sum. Always call df.shape and df.duplicated().sum() before and after a merge, and reset the index with df.reset_index(drop=True) when you don't need the original index.
Multi-Column Group Keys and Unexpected Combinations
When you group by multiple columns, each unique combination of values becomes a group. This is correct behavior, but it surprises people who expect a hierarchical result instead of a flat multi-index.
df2 = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['A', 'B', 'A', 'A'],
'sales': [100, 200, 300, 400]
})
result = df2.groupby(['region', 'product'])['sales'].sum()
print(result)
# region product
# North A 100
# B 200
# South A 700
# dtype: int64
If you expected a separate row for South/B (which has no data), it won't appear unless you're using a Categorical with observed=False. More importantly, accessing the result with a single-level key like result['North'] returns a Series, not a scalar — which then propagates type errors downstream. Use result.loc[('North', 'A')] with a tuple key when accessing a specific cell in a multi-index result.
sort=True Changes Row Order Downstream
By default, groupby(sort=True) sorts groups alphabetically or numerically. This is fine for display, but if you're relying on positional row order in any downstream step — like pairing results with a separate list or using iloc — you may be pairing the wrong values.
labels = ['South total', 'North total'] # manually ordered
result = df.groupby('region', sort=True)['sales'].sum()
# result index is ['North', 'South'] after sort — opposite of labels
result.index = labels # now North gets labeled 'South total'
Use sort=False if you need to preserve insertion order, and never assign an index or label list to a groupby result unless you've explicitly verified the group order first. Better yet, use a dictionary or named Series to pair labels with values rather than relying on positional alignment.
Common Pitfalls at a Glance
| Pitfall | Default behavior | Fix |
|---|---|---|
| NaN in group key | Rows dropped silently | groupby(dropna=False) |
| Categorical with empty categories | Phantom zero rows appear | groupby(observed=True) |
| agg + manual join | Index mismatch or NaN fill | Use transform instead |
| apply with side effects | Function called twice on first group | Remove side effects from apply |
| Duplicate rows after merge | Inflated aggregation totals | Check duplicates before groupby |
| Multi-index key access | Single key returns Series, not scalar | Use tuple key with .loc |
| sort=True with positional labels | Wrong label-to-value pairing | Use sort=False or named mapping |
How to Verify groupby Results Before They Reach Production
A few quick sanity checks can catch most of these issues before a wrong number ends up in a dashboard or report.
First, compare row counts. The sum of group sizes should equal the number of rows in your DataFrame (if dropna=False) or your DataFrame minus the NaN-key rows (if dropna=True):
group_sizes = df.groupby('region', dropna=False).size()
print(group_sizes.sum(), 'vs', len(df)) # should match
Second, verify totals against a global aggregation:
group_sum = df.groupby('region', dropna=False)['sales'].sum().sum()
global_sum = df['sales'].sum()
assert group_sum == global_sum, f"Totals diverge: {group_sum} vs {global_sum}"
Third, spot-check a known group manually. Pick one group value, filter the DataFrame directly with .loc, and compute the same metric. If the result differs from what groupby produced, you've isolated a bug.
Next Steps
Now that you know where the traps are, here are concrete things to do right now:
- Audit any existing
groupbycalls in your codebase and check whetherdropna=Trueis silently dropping rows you need. - Add a row-count assertion and total-sum assertion as automated checks in any ETL or reporting script that uses
groupby. - Replace any pattern of
agg+ manual merge-back withtransformwhere the use case fits — it's simpler and index-safe. - If you're on Pandas below 2.0 and using Categorical columns, set
observed=Trueexplicitly on everygroupbycall to avoid phantom rows. - Run
df.duplicated().sum()after every merge before any aggregation step — it takes one line and has saved countless incorrect reports.
📤 Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!