Pandas pivot_table vs groupby: Why Your Aggregations Don't Match
You write what looks like the same aggregation twice β once with groupby, once with pivot_table β and the totals don't match. You double-check your column names, your filter logic, your function. Everything looks right. The numbers are still different.
This is one of the most confusing moments in early Pandas work, and it has a small set of concrete explanations. Once you understand them, you'll also understand when each tool is the right choice.
What you'll learn
- How
pivot_tableandgroupbyhandleNaNvalues differently - Why the default aggregation behavior differs between the two
- How margins (subtotals) in
pivot_tablecan mislead you - When to use
groupbyand when to usepivot_table - Concrete patterns to make both produce identical results
Prerequisites
You should be comfortable with basic DataFrame operations and have Pandas installed (any version in the 1.x or 2.x series works for everything here). The examples use Python 3.9+ syntax but nothing exotic.
A Minimal Reproducible Example
Start with a small dataset so you can see exactly what's happening. This is a sales table with a missing value deliberately placed in it.
import pandas as pd
import numpy as np
data = {
"region": ["North", "North", "South", "South", "East"],
"product": ["A", "B", "A", "B", "A"],
"sales": [100, 200, 150, np.nan, 300],
}
df = pd.DataFrame(data)
print(df)
The NaN in the South/B row is the landmine. Now run both aggregations:
# groupby approach
gb = df.groupby(["region", "product"])["sales"].sum()
print(gb)
# pivot_table approach
pt = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc="sum"
)
print(pt)
The per-cell numbers look the same here, but the moment you add margins=True to pivot_table, or change aggfunc, things diverge. Let's go through each reason why.
Reason 1: NaN Handling Is Not the Same
groupby with sum treats NaN as zero by default (in Pandas 1.x this was configurable via min_count; in Pandas 2.x the default changed β see below). pivot_table uses fill_value, which defaults to nothing, leaving NaN cells empty unless you set it explicitly.
This matters most when you're comparing row or column totals. A missing cell in pivot_table stays NaN, so a downstream .sum(axis=1) silently skips it β which is not the same as a groupby sum that included a zero-equivalent.
# pivot_table without fill_value β NaN stays NaN
pt_no_fill = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc="sum"
)
print(pt_no_fill)
# South/B is NaN
# pivot_table with fill_value=0 β matches groupby sum behavior
pt_filled = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc="sum",
fill_value=0
)
print(pt_filled)
# South/B is 0
If your totals need to match, set fill_value=0 whenever you're summing numeric data.
Reason 2: The Default aggfunc Is Mean, Not Sum
This catches almost everyone at least once. The default aggfunc in pivot_table is "mean", not "sum". groupby has no default aggregation β you have to call .sum(), .mean(), or whatever you want explicitly.
So if you call df.pivot_table(values="sales", index="region") without specifying aggfunc, you get means. If you then compare that to df.groupby("region")["sales"].sum(), the numbers will be different β and both are technically correct, just answering different questions.
# These answer DIFFERENT questions:
df.pivot_table(values="sales", index="region") # mean (default)
df.groupby("region")["sales"].sum() # sum (explicit)
# These answer the SAME question:
df.pivot_table(values="sales", index="region", aggfunc="sum")
df.groupby("region")["sales"].sum()
Always specify aggfunc explicitly. It's four extra characters that prevent a lot of confusion.
Reason 3: How margins=True Computes Its Totals
The margins=True parameter adds a grand total row and column to pivot_table. The problem is that margins compute their totals from the original data, not by summing the cells already in the table. When combined with fill_value, this can produce rows where the margin total looks higher than the sum of visible cells.
pt_margins = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc="sum",
fill_value=0,
margins=True
)
print(pt_margins)
The "All" row for South shows 150, because it re-aggregates from the raw data where South/B is NaN and NaN is excluded from sum. The South row in the table shows 0 for B (from fill_value), so if you manually add the cells you get 150 + 0 = 150. They happen to match here, but with aggfunc="mean", the margin mean is computed over all South rows, not by averaging the means in the table cells β which gives a different result whenever group sizes differ.
The rule: treat margin values as independently computed from raw data, not as summaries of your pivot cells.
Reason 4: groupby Excludes NaN Keys by Default
When your grouping column itself contains NaN, groupby drops those rows entirely unless you pass dropna=False. pivot_table also drops them by default, but you can keep them with dropna=False on the pivot call.
data_with_nan_key = {
"region": ["North", None, "South"],
"sales": [100, 200, 150],
}
dkf = pd.DataFrame(data_with_nan_key)
# groupby: None row is dropped
dkf.groupby("region")["sales"].sum()
# groupby: keep the NaN group
dkf.groupby("region", dropna=False)["sales"].sum()
# pivot_table: also drops NaN keys by default
dkf.pivot_table(values="sales", index="region", aggfunc="sum")
# pivot_table: keep NaN keys
dkf.pivot_table(values="sales", index="region", aggfunc="sum", dropna=False)
If your row counts downstream don't add up to your original DataFrame length, a NaN key is often the culprit.
Reason 5: Multiple aggfunc Values Change the Column Structure
When you pass a list to aggfunc in pivot_table, you get a MultiIndex column. If you then try to compare a single column from that result against a groupby output, you're indexing incorrectly.
pt_multi = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc=["sum", "mean"]
)
print(pt_multi.columns) # MultiIndex: (sum, A), (sum, B), (mean, A), (mean, B)
# Access the sum sub-table
sum_table = pt_multi["sum"]
print(sum_table)
The structure itself isn't wrong, but pulling a single number out of a multi-aggregation pivot and comparing it to a groupby result requires knowing exactly which level you're slicing.
When to Use groupby vs pivot_table
These tools solve slightly different problems. Picking the right one from the start avoids the reconciliation headache.
| Scenario | Best tool | Reason |
|---|---|---|
| Single grouping dimension, one aggregation | groupby | Less overhead, chainable, more explicit |
| Two grouping dimensions you want as row/column axes | pivot_table | Produces a readable cross-tab directly |
| Multiple aggregation functions at once | Either | groupby().agg() or pivot_table(aggfunc=[...]) |
| Reporting output with subtotals | pivot_table | margins=True adds grand totals |
| Programmatic downstream processing | groupby | Returns a Series/DataFrame with a flat index |
| Exploratory analysis over a wide dataset | pivot_table | Easier to scan visually |
In practice, groupby is the workhorse for computation and pivot_table is a display-oriented shortcut. When precision matters, groupby gives you fewer surprises.
Making Both Produce Identical Results
If you need them to match for a validation or audit, align all three variables: aggregation function, NaN handling, and key exclusion.
# Target: sum of sales by region and product, NaN sales treated as 0, NaN keys excluded
# groupby version
gb_result = (
df.groupby(["region", "product"], dropna=True)["sales"]
.sum(min_count=0) # treat NaN as 0 rather than skipping
.unstack(fill_value=0)
)
# pivot_table version
pt_result = df.pivot_table(
values="sales",
index="region",
columns="product",
aggfunc="sum",
fill_value=0,
dropna=True,
observed=True # Pandas 2.x: avoids including unobserved categories
)
# Compare
print(gb_result.equals(pt_result))
Note the observed=True parameter. In Pandas 2.x, when your grouping column is a Categorical dtype, groupby and pivot_table may include unobserved category combinations unless you set this flag. It's another silent source of row-count mismatches.
Common Pitfalls at a Glance
- Forgetting to set aggfunc β pivot_table defaults to mean; always be explicit.
- Comparing pivot cells to groupby totals β the margin row in pivot_table is not a sum of its cells when fill_value is involved.
- Categorical columns without observed=True β Pandas 2.x will include category combinations that don't exist in the data, inflating your row count.
- Sorting differences β groupby preserves insertion order by default in recent Pandas; pivot_table sorts index alphabetically. Use
.sort_index()before comparing. - Calling .reset_index() on one but not the other β index shape mismatches make
.equals()return False even when values match.
Wrapping Up
The aggregation mismatch between pivot_table and groupby almost always comes down to one of five things: default aggfunc, NaN value handling, NaN key exclusion, how margins are computed, or unobserved Categorical combinations. None of these are bugs β they're documented behavior that becomes intuitive once you've seen each case once.
Here are four concrete actions to take next:
- Audit any existing pivot_table calls in your codebase and confirm
aggfuncis explicit, not relying on the"mean"default. - Add
fill_value=0to any pivot_table used for financial or count data where missing cells should read as zero, not NaN. - If you're on Pandas 2.x and using Categorical grouping columns, add
observed=Trueto both your groupby and pivot_table calls. - When auditing totals, verify the margins row by re-computing it independently with
groupbyrather than trusting the pivot margin. - Use
groupby().agg()for programmatic pipelines where correctness is critical, and reservepivot_tablefor human-readable cross-tabs.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!