Pandas pivot_table vs groupby: Why Your Aggregations Don't Match

May 20, 2026 6 min read 39 views
Two side-by-side data tables with highlighted mismatched cells on a soft blue gradient background, representing aggregation differences in data analysis.

You write what looks like the same aggregation twice β€” once with groupby, once with pivot_table β€” and the totals don't match. You double-check your column names, your filter logic, your function. Everything looks right. The numbers are still different.

This is one of the most confusing moments in early Pandas work, and it has a small set of concrete explanations. Once you understand them, you'll also understand when each tool is the right choice.

What you'll learn

  • How pivot_table and groupby handle NaN values differently
  • Why the default aggregation behavior differs between the two
  • How margins (subtotals) in pivot_table can mislead you
  • When to use groupby and when to use pivot_table
  • Concrete patterns to make both produce identical results

Prerequisites

You should be comfortable with basic DataFrame operations and have Pandas installed (any version in the 1.x or 2.x series works for everything here). The examples use Python 3.9+ syntax but nothing exotic.

A Minimal Reproducible Example

Start with a small dataset so you can see exactly what's happening. This is a sales table with a missing value deliberately placed in it.

import pandas as pd
import numpy as np

data = {
    "region": ["North", "North", "South", "South", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "sales": [100, 200, 150, np.nan, 300],
}

df = pd.DataFrame(data)
print(df)

The NaN in the South/B row is the landmine. Now run both aggregations:

# groupby approach
gb = df.groupby(["region", "product"])["sales"].sum()
print(gb)

# pivot_table approach
pt = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc="sum"
)
print(pt)

The per-cell numbers look the same here, but the moment you add margins=True to pivot_table, or change aggfunc, things diverge. Let's go through each reason why.

Reason 1: NaN Handling Is Not the Same

groupby with sum treats NaN as zero by default (in Pandas 1.x this was configurable via min_count; in Pandas 2.x the default changed β€” see below). pivot_table uses fill_value, which defaults to nothing, leaving NaN cells empty unless you set it explicitly.

This matters most when you're comparing row or column totals. A missing cell in pivot_table stays NaN, so a downstream .sum(axis=1) silently skips it β€” which is not the same as a groupby sum that included a zero-equivalent.

# pivot_table without fill_value β€” NaN stays NaN
pt_no_fill = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc="sum"
)
print(pt_no_fill)
# South/B is NaN

# pivot_table with fill_value=0 β€” matches groupby sum behavior
pt_filled = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc="sum",
    fill_value=0
)
print(pt_filled)
# South/B is 0

If your totals need to match, set fill_value=0 whenever you're summing numeric data.

Reason 2: The Default aggfunc Is Mean, Not Sum

This catches almost everyone at least once. The default aggfunc in pivot_table is "mean", not "sum". groupby has no default aggregation β€” you have to call .sum(), .mean(), or whatever you want explicitly.

So if you call df.pivot_table(values="sales", index="region") without specifying aggfunc, you get means. If you then compare that to df.groupby("region")["sales"].sum(), the numbers will be different β€” and both are technically correct, just answering different questions.

# These answer DIFFERENT questions:
df.pivot_table(values="sales", index="region")  # mean (default)
df.groupby("region")["sales"].sum()             # sum (explicit)

# These answer the SAME question:
df.pivot_table(values="sales", index="region", aggfunc="sum")
df.groupby("region")["sales"].sum()

Always specify aggfunc explicitly. It's four extra characters that prevent a lot of confusion.

Reason 3: How margins=True Computes Its Totals

The margins=True parameter adds a grand total row and column to pivot_table. The problem is that margins compute their totals from the original data, not by summing the cells already in the table. When combined with fill_value, this can produce rows where the margin total looks higher than the sum of visible cells.

pt_margins = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc="sum",
    fill_value=0,
    margins=True
)
print(pt_margins)

The "All" row for South shows 150, because it re-aggregates from the raw data where South/B is NaN and NaN is excluded from sum. The South row in the table shows 0 for B (from fill_value), so if you manually add the cells you get 150 + 0 = 150. They happen to match here, but with aggfunc="mean", the margin mean is computed over all South rows, not by averaging the means in the table cells β€” which gives a different result whenever group sizes differ.

The rule: treat margin values as independently computed from raw data, not as summaries of your pivot cells.

Reason 4: groupby Excludes NaN Keys by Default

When your grouping column itself contains NaN, groupby drops those rows entirely unless you pass dropna=False. pivot_table also drops them by default, but you can keep them with dropna=False on the pivot call.

data_with_nan_key = {
    "region": ["North", None, "South"],
    "sales": [100, 200, 150],
}
dkf = pd.DataFrame(data_with_nan_key)

# groupby: None row is dropped
dkf.groupby("region")["sales"].sum()

# groupby: keep the NaN group
dkf.groupby("region", dropna=False)["sales"].sum()

# pivot_table: also drops NaN keys by default
dkf.pivot_table(values="sales", index="region", aggfunc="sum")

# pivot_table: keep NaN keys
dkf.pivot_table(values="sales", index="region", aggfunc="sum", dropna=False)

If your row counts downstream don't add up to your original DataFrame length, a NaN key is often the culprit.

Reason 5: Multiple aggfunc Values Change the Column Structure

When you pass a list to aggfunc in pivot_table, you get a MultiIndex column. If you then try to compare a single column from that result against a groupby output, you're indexing incorrectly.

pt_multi = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc=["sum", "mean"]
)
print(pt_multi.columns)  # MultiIndex: (sum, A), (sum, B), (mean, A), (mean, B)

# Access the sum sub-table
sum_table = pt_multi["sum"]
print(sum_table)

The structure itself isn't wrong, but pulling a single number out of a multi-aggregation pivot and comparing it to a groupby result requires knowing exactly which level you're slicing.

When to Use groupby vs pivot_table

These tools solve slightly different problems. Picking the right one from the start avoids the reconciliation headache.

ScenarioBest toolReason
Single grouping dimension, one aggregationgroupbyLess overhead, chainable, more explicit
Two grouping dimensions you want as row/column axespivot_tableProduces a readable cross-tab directly
Multiple aggregation functions at onceEithergroupby().agg() or pivot_table(aggfunc=[...])
Reporting output with subtotalspivot_tablemargins=True adds grand totals
Programmatic downstream processinggroupbyReturns a Series/DataFrame with a flat index
Exploratory analysis over a wide datasetpivot_tableEasier to scan visually

In practice, groupby is the workhorse for computation and pivot_table is a display-oriented shortcut. When precision matters, groupby gives you fewer surprises.

Making Both Produce Identical Results

If you need them to match for a validation or audit, align all three variables: aggregation function, NaN handling, and key exclusion.

# Target: sum of sales by region and product, NaN sales treated as 0, NaN keys excluded

# groupby version
gb_result = (
    df.groupby(["region", "product"], dropna=True)["sales"]
    .sum(min_count=0)  # treat NaN as 0 rather than skipping
    .unstack(fill_value=0)
)

# pivot_table version
pt_result = df.pivot_table(
    values="sales",
    index="region",
    columns="product",
    aggfunc="sum",
    fill_value=0,
    dropna=True,
    observed=True  # Pandas 2.x: avoids including unobserved categories
)

# Compare
print(gb_result.equals(pt_result))

Note the observed=True parameter. In Pandas 2.x, when your grouping column is a Categorical dtype, groupby and pivot_table may include unobserved category combinations unless you set this flag. It's another silent source of row-count mismatches.

Common Pitfalls at a Glance

  • Forgetting to set aggfunc β€” pivot_table defaults to mean; always be explicit.
  • Comparing pivot cells to groupby totals β€” the margin row in pivot_table is not a sum of its cells when fill_value is involved.
  • Categorical columns without observed=True β€” Pandas 2.x will include category combinations that don't exist in the data, inflating your row count.
  • Sorting differences β€” groupby preserves insertion order by default in recent Pandas; pivot_table sorts index alphabetically. Use .sort_index() before comparing.
  • Calling .reset_index() on one but not the other β€” index shape mismatches make .equals() return False even when values match.

Wrapping Up

The aggregation mismatch between pivot_table and groupby almost always comes down to one of five things: default aggfunc, NaN value handling, NaN key exclusion, how margins are computed, or unobserved Categorical combinations. None of these are bugs β€” they're documented behavior that becomes intuitive once you've seen each case once.

Here are four concrete actions to take next:

  1. Audit any existing pivot_table calls in your codebase and confirm aggfunc is explicit, not relying on the "mean" default.
  2. Add fill_value=0 to any pivot_table used for financial or count data where missing cells should read as zero, not NaN.
  3. If you're on Pandas 2.x and using Categorical grouping columns, add observed=True to both your groupby and pivot_table calls.
  4. When auditing totals, verify the margins row by re-computing it independently with groupby rather than trusting the pivot margin.
  5. Use groupby().agg() for programmatic pipelines where correctness is critical, and reserve pivot_table for human-readable cross-tabs.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.