Fixing Python Pandas GroupBy That Silently Drops Columns With Mixed Types
You run a groupby().sum() on a DataFrame with a dozen columns, and the result comes back with only five. No warning. No error. Pandas just quietly discarded the rest. If this has happened to you mid-analysis, you know how disorienting it is β especially when you're not sure which columns disappeared or why.
The root cause is almost always mixed or non-numeric column types interacting with aggregation functions that only know how to handle numbers. Understanding exactly how Pandas makes that decision β and how to override it β is what this article is about.
What You'll Learn
- Why Pandas silently drops columns during
groupbyaggregations - How to detect which columns were dropped and why
- Five concrete fixes, from the quickest one-liner to the most explicit approach
- Which Pandas version changed the default behavior and what to expect going forward
- Pitfalls that cause the problem to reappear even after you think you've fixed it
The Problem: Columns That Disappear Without Warning
Consider a sales DataFrame with columns for region, rep_name, revenue, units, and notes. You want totals by region:
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'rep_name': ['Alice', 'Bob', 'Carol', 'Dave'],
'revenue': [12000, 9500, 11000, 8000],
'units': [120, 95, 110, 80],
'notes': ['Q1', 'Q1', 'Q1', 'Q1'],
})
result = df.groupby('region').sum()
print(result.columns.tolist())
You might expect ['rep_name', 'revenue', 'units', 'notes']. Instead you get something like ['revenue', 'units']. The string columns rep_name and notes are gone.
Why Pandas GroupBy Drops Columns Silently
When you call an aggregation method like .sum(), .mean(), or .std() on a GroupBy object, Pandas needs to decide what to do with every non-key column. For numeric columns the math is straightforward. For object (string), boolean, or mixed-type columns, the answer depends on both the operation and the Pandas version.
Before Pandas 1.5, many aggregation functions simply skipped non-numeric columns without telling you. In Pandas 2.0, this behavior was tightened further: the numeric_only parameter, which previously defaulted to None (try numeric, silently skip others), now defaults to False for most methods β meaning it raises an error if any column can't be aggregated. However, .sum() still has edge cases because string concatenation is technically valid, so the behavior isn't perfectly uniform across methods.
The practical result is that between Pandas versions, code that "worked" by silently dropping columns may now raise errors, or vice versa. Either way, you lose predictability. The fix is to be explicit about which columns you aggregate and how.
How to Reproduce the Bug
Here is a minimal reproduction that works across Pandas 1.x and 2.x and clearly shows the column-drop behavior:
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B'],
'label': ['x', 'y', 'x', 'y'], # object dtype
'value': [10, 20, 30, 40], # int64
'score': [1.1, 2.2, 3.3, 4.4], # float64
'flag': [True, False, True, True], # bool
})
print("Original dtypes:")
print(df.dtypes)
print()
result = df.groupby('category').sum(numeric_only=True)
print("GroupBy result columns:", result.columns.tolist())
With numeric_only=True, only value and score appear. The label column (object) is gone. The flag column (bool) may or may not appear depending on the Pandas version β bool is technically numeric, but some operations exclude it.
Detecting Which Columns Were Dropped
Before applying a fix, it helps to audit your DataFrame and know exactly what you're working with. This snippet compares columns before and after:
groupby_key = 'category'
agg_fn = 'sum'
expected_cols = set(df.columns) - {groupby_key}
result = df.groupby(groupby_key).sum(numeric_only=True)
actual_cols = set(result.columns)
dropped = expected_cols - actual_cols
print("Dropped columns:", dropped)
print("Retained dtypes:")
print(df[list(actual_cols)].dtypes)
You can also inspect dtypes upfront to predict which columns will be problematic:
non_numeric = df.select_dtypes(exclude='number').columns.tolist()
print("Non-numeric columns (likely to be dropped):", non_numeric)
Running this before a long aggregation pipeline saves you from discovering missing columns two steps later.
Fix 1: Select Only the Columns You Need Before GroupBy
The simplest and most readable fix is to explicitly select the columns you want before calling groupby. This removes ambiguity entirely and makes your intent obvious to anyone reading the code later.
# Only aggregate 'value' and 'score'; don't touch 'label' at all
result = df[['category', 'value', 'score']].groupby('category').sum()
print(result)
This approach is the right default for most pipelines. You know what you need; tell Pandas exactly that. If a new column appears in the source data later, it won't accidentally enter your aggregation and break things downstream.
Fix 2: Use numeric_only Parameter Explicitly
If you genuinely want all numeric columns aggregated and don't care about the rest, pass numeric_only=True so the behavior is intentional and documented in your code rather than an implicit side effect:
result = df.groupby('category').sum(numeric_only=True)
The key word here is explicitly. The old silent default was the problem. When you write numeric_only=True, a future reader understands you made a choice, not a mistake. If you're on Pandas 2.0+, this parameter is available on .sum(), .mean(), .std(), .var(), .min(), .max(), and a few others.
One gotcha: numeric_only is not available on every aggregation method. On .first() and .last(), for example, it was removed in Pandas 2.0. Check the docs for the specific method you're using.
Fix 3: Aggregate with Named Aggregations Using agg()
When you need different aggregation functions per column, .agg() with a dictionary gives you precise control and makes it impossible to accidentally drop columns you care about:
result = df.groupby('category').agg(
total_value=('value', 'sum'),
avg_score=('score', 'mean'),
first_label=('label', 'first'),
)
print(result)
This named aggregation syntax (available since Pandas 0.25) produces a clean, flat result where you control both the output column name and the aggregation function. String columns can use 'first', 'last', 'count', or any custom function β they don't have to be dropped just because they can't be summed.
If your non-numeric columns hold metadata you want to carry through, this is the most expressive approach. It also pairs well with the debugging approach in fixing silent errors in Pandas apply() on axis=1, since you can use lambdas inside agg() the same way.
Fix 4: Convert Mixed-Type Columns Before Aggregating
Sometimes a column looks numeric but is stored as object dtype because it contains a mix of integers and strings (often from a CSV import or a database query with nulls). In that case, convert first:
df['value'] = pd.to_numeric(df['value'], errors='coerce')
# NaN replaces unconvertible values; decide whether to fill or drop them
df['value'] = df['value'].fillna(0)
result = df.groupby('category').sum(numeric_only=True)
Use errors='coerce' rather than errors='raise' during diagnosis β it lets you see where the bad values land as NaN before you decide how to handle them. This is the same pattern described in detail in the article on Pandas read_csv silently truncating large integer columns, where dtype inference during import is the root cause.
You can also use df.convert_dtypes() to let Pandas infer the best nullable dtype for each column before you aggregate. It's not a silver bullet, but it catches obvious cases like integers stored as object:
df = df.convert_dtypes()
print(df.dtypes)
Fix 5: Use groupby with apply() for Full Control
When you need to apply complex logic per group and preserve all columns, groupby().apply() gives you the most flexibility. Each group is passed as a full DataFrame to your function:
def summarize_group(group):
return pd.Series({
'total_value': group['value'].sum(),
'avg_score': group['score'].mean(),
'labels': ', '.join(group['label'].unique()),
'count': len(group),
})
result = df.groupby('category').apply(summarize_group)
print(result)
This approach is slower on large DataFrames because it can't use vectorized operations internally, but it's the right tool when the transformation is too complex for a dictionary of aggregation functions. For performance-sensitive paths, prefer Fix 1 or Fix 3.
If you're frequently hitting issues where Pandas drops or mishandles rows during groupby operations, it's worth reading up on the related bug where groupby returns NaN when the group key contains None β the same root-cause pattern of implicit handling applies there too.
Common Pitfalls to Avoid
Even after applying a fix, a few recurring mistakes bring the problem back:
- Not pinning your Pandas version. Behavior around
numeric_onlychanged significantly between 1.3, 1.5, and 2.0. If you run on different environments without version pinning, you'll get different results. Lock your version inrequirements.txtorpyproject.toml. - Trusting dtype inference after a merge. Joining two DataFrames can silently change column dtypes, especially if one side has nulls that force an int column to become float or object. Always check
df.dtypesafter a merge before aggregating. The article on Pandas str.split silently dropping rows with missing delimiters shows how string/null interactions produce similar surprises. - Using
as_index=Falsewithout accounting for column name collisions. When you passas_index=False, the group key becomes a regular column. If you then reset the index manually as well, column names can collide or duplicate in unexpected ways. - Mixing aggregation methods and transform.
groupby().transform()returns a same-length DataFrame aligned with the original index β it doesn't aggregate. If you accidentally use it where you meantagg(), you'll get different shapes and downstream errors. - Forgetting
observed=Truewith Categorical columns. If any group key is a Categorical dtype, Pandas by default creates a row for every category combination, even ones that don't exist in your data. Passobserved=Trueto suppress phantom rows. This is separate from the mixed-type issue but frequently appears in the same pipelines.
One pattern worth calling out specifically: columns that hold numeric-looking strings like '100' or '3.14' will be treated as objects and dropped. If you import data from CSV without explicit dtype arguments, this is extremely common. Pair your groupby fix with explicit dtype specification at read time β see the Pandas to_datetime producing NaT on mixed formats article for a related approach to enforcing types at the import boundary.
Wrapping Up
Pandas dropping columns silently is one of those bugs that wastes hours because the operation doesn't fail β it just gives you a quietly wrong answer. The underlying cause is predictable once you know it: aggregation functions have no general way to handle non-numeric columns, so Pandas historically chose to discard them rather than error.
Here are the concrete steps to take right now:
- Audit your DataFrame first. Run
df.dtypesanddf.select_dtypes(exclude='number').columnsbefore any groupby to know exactly what types you're dealing with. - Be explicit about column selection. Slice to only the columns you need before calling
groupby. This is the safest default habit. - Use
.agg()with a dictionary or named aggregations when you need different functions per column, including string columns. - Set
numeric_only=Trueintentionally when you genuinely want to skip non-numeric columns, so the behavior is documented in the code. - Pin your Pandas version and run a quick smoke test after any upgrade that touches data processing pipelines.
Frequently Asked Questions
Why does pandas groupby sum drop string columns from the result?
Pandas drops string (object) columns from .sum() because it can't perform a numeric sum on text by default. With numeric_only defaulting to True in many older versions, non-numeric columns are silently excluded rather than raising an error.
How do I keep all columns after a pandas groupby operation?
Use .agg() with a dictionary specifying a function for each column, including string columns using 'first' or 'last'. This forces Pandas to include every column you name and prevents silent dropping.
Did the pandas groupby numeric_only default change in pandas 2.0?
Yes. In Pandas 2.0 the default for numeric_only was changed from None (silently skip non-numeric) to False for most aggregation methods, which means they now raise a TypeError instead of quietly dropping columns. Setting numeric_only=True explicitly restores the old skip behavior.
How can I tell which columns pandas groupby dropped without running the aggregation twice?
Call df.select_dtypes(exclude='number').columns before your groupby to see which columns are non-numeric. Compare that list against your expected output columns to predict exactly what will be dropped for numeric aggregation functions.
Is it possible to aggregate both numeric and string columns in the same pandas groupby call?
Yes. Use groupby().agg() with a per-column function mapping: numeric columns can use 'sum' or 'mean', while string columns can use 'first', 'last', or a custom lambda. This avoids dropping any column and keeps your aggregation logic explicit.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!