Fixing Python Pandas value_counts That Silently Excludes NaN From Results
You run value_counts() on a column and the frequency table looks perfectly clean. No warnings, no errors. Then you check df[col].isna().sum() and realize you have 400 null rows that never showed up. That silence is the bug.
Pandas drops NaN from value_counts() results by default. It does this quietly, and it can silently distort every percentage and ranking in your analysis if you don't know to look for it.
What You'll Learn
- Why
value_counts()excludesNaNby default and what that means for your results - How to include missing values with
dropna=False - How normalization behaves differently once
NaNis included - How to count missing values across multiple columns efficiently
- When excluding
NaNis actually the right call
Prerequisites
You need pandas installed (any version from 1.0 onward covers everything here) and a basic familiarity with Series and DataFrame objects. The examples use Python 3.8+.
Why value_counts() Drops NaN by Default
The dropna parameter on value_counts() defaults to True. This design decision mirrors SQL's treatment of NULL in aggregations: missing values are considered non-observations and excluded unless you explicitly ask for them.
The intent is defensible in some contexts. If you're generating a report on product categories, an unknown category might be noise rather than signal. But in data cleaning and quality analysis workflows, missing values are exactly what you're looking for. Dropping them by default means you need to remember to override the behavior every single time.
The problem compounds when you use normalize=True. The percentages are computed over the non-null population, so every percentage in your output is inflated relative to the true dataset size. You end up with a frequency table that sums to 100% while hiding a potentially large chunk of your data.
Reproducing the Silent NaN Exclusion
Here's a minimal example that shows exactly what happens. Build a small Series with some missing values baked in:
import pandas as pd
import numpy as np
status = pd.Series(["active", "inactive", "active", None, "active", np.nan, "inactive", None])
print(status.value_counts())
Output:
active 3
inactive 2
dtype: int64
Three NaN-like values exist in the series (two None and one np.nan), and none of them appear in the result. You get back 5 counts from an 8-element series with no indication that anything was dropped.
Now try with normalize=True:
print(status.value_counts(normalize=True))
active 0.600000
inactive 0.400000
dtype: float64
The percentages sum to 100%, which is factually wrong relative to the full dataset. Active is 3/8 = 37.5%, not 60%. That error flows silently into any dashboard, model feature, or business report built on this output.
The Fix: dropna=False
Pass dropna=False and NaN appears in the result like any other value:
print(status.value_counts(dropna=False))
active 3
NaN 3
inactive 2
dtype: int64
Now you see the full picture. Both None and np.nan are grouped together under a single NaN label. Pandas treats them as the same missing sentinel at this level of aggregation, so you don't need to handle them separately.
You can also sort by count descending (the default) or ascending if you want to spot rare values at the top:
print(status.value_counts(dropna=False, ascending=True))
inactive 2
active 3
NaN 3
dtype: int64
This is particularly useful when you're auditing a column and want to confirm that your most common values aren't missing ones masquerading as data.
Normalizing Percentages When NaN Is Present
Once you pass dropna=False, normalization is computed over the full series length, which is what you actually want:
print(status.value_counts(dropna=False, normalize=True))
active 0.375
NaN 0.375
inactive 0.250
dtype: float64
Now the percentages reflect reality: 37.5% active, 37.5% missing, 25% inactive. These are the numbers you need to make honest decisions about imputation, data collection quality, or whether a feature is usable in a model.
If you want to display these as formatted percentages, you can chain a simple transform:
pct = status.value_counts(dropna=False, normalize=True).mul(100).round(1).astype(str) + "%"
print(pct)
active 37.5%
NaN 37.5%
inactive 25.0%
dtype: object
Counting NaN Values Across Multiple Columns
On a full DataFrame you often want missing-value frequency counts for every column at once. value_counts() operates on a Series, so you need a different approach for the column-level overview.
The fastest way to get absolute counts of NaN per column:
df = pd.DataFrame({
"status": ["active", "inactive", None, "active", np.nan],
"region": ["north", None, "south", None, "east"],
"score": [8.5, 7.0, np.nan, 9.1, np.nan],
})
print(df.isna().sum())
status 2
region 2
score 2
dtype: int64
To get missing-value percentages per column:
print(df.isna().mean().mul(100).round(1))
status 40.0
region 40.0
score 40.0
dtype: float64
When you need the full value_counts() breakdown for every column in a loop, you can use a dictionary comprehension:
freq = {col: df[col].value_counts(dropna=False) for col in df.columns}
for col, counts in freq.items():
print(f"\n--- {col} ---")
print(counts)
This pattern is useful during exploratory data analysis when you want to audit each column before deciding on a cleaning strategy. For a deeper look at how NaN propagates through row-level operations, see the article on fixing Pandas dropna that removes rows with partial NaN when you need complete cases only.
When to Keep NaN Out (and When You Shouldn't)
There are legitimate situations where the default dropna=True behavior is exactly right.
- Reporting known-good data: If missing rows have already been filtered upstream and you're summarizing a clean dataset, including
NaNin output adds noise. - Model input validation: Confirming that a feature column is fully populated before training β you want a clean count without null entries polluting the frequency table.
- Display-layer summaries: End-user dashboards often treat missing as non-existent by convention; showing a
NaNrow can confuse non-technical stakeholders.
But in data cleaning, quality audits, and any EDA phase, you should default to dropna=False. The discipline is simple: treat NaN as a first-class value until you consciously decide to remove it. The same principle applies when you're debugging other silent-drop behaviors, like the one covered in fixing Pandas groupby that returns NaN when the group key contains None.
Common Pitfalls With value_counts and Missing Data
Pitfall 1: Assuming the index tells you if NaN is present
Even with dropna=False, NaN appears in the index of the returned Series, not as a string. If you try to filter or look it up with result["NaN"] you'll get a KeyError. Use result[result.index.isna()] instead.
result = status.value_counts(dropna=False)
# Wrong:
# result["NaN"] # KeyError
# Correct:
nan_count = result[result.index.isna()].sum()
print(nan_count) # 3
Pitfall 2: Mixing pd.NA and np.nan in the same column
In columns with pandas nullable dtypes (Int64, StringDtype), missing values are represented as pd.NA, not np.nan. Pandas still groups them correctly under NaN in the value_counts output, but watch out when you're doing comparisons elsewhere in your pipeline. pd.isna() handles both; direct equality checks with np.nan do not.
Pitfall 3: Forgetting dropna=False after a resample or groupby
If you call value_counts() inside a groupby().apply() block, the dropna parameter applies to each group independently. A group might have zero nulls but another might have many β and the default behavior silently handles each group differently depending on what's in it. Be explicit in every call inside applied functions. Related: Pandas apply() can also silently swallow errors on axis=1, which compounds this kind of invisible data loss.
Pitfall 4: Using value_counts on object columns with whitespace
A string " " (space) and np.nan are different things. Columns imported from CSV sometimes have empty-string values that aren't caught by isna(). Run a quick df[col].replace("", np.nan) before counting if your source is CSV or Excel. For related CSV-to-DataFrame quirks, the article on fixing Pandas to_datetime that silently produces NaT on mixed formats is worth reading alongside this one.
Pitfall 5: Treating NaN frequency as always bad
Sometimes high NaN frequency in a column is the signal you needed to find. A column that is 80% missing might be a candidate for dropping from a model, for a targeted data collection effort, or for an upstream pipeline fix. Don't suppress that information before you've understood what it means.
Wrapping Up: Next Steps
The one-parameter fix is simple. But the habit of thinking about where missing values go is what separates reliable data pipelines from ones that produce subtly wrong results. Here's what to do now:
- Audit your existing notebooks: Search for every call to
value_counts()in your codebase and check whetherdropna=Falseshould have been there. - Add a missing-value audit step to your EDA template: Make
df.isna().sum()and per-columnvalue_counts(dropna=False)a standard first step before any analysis. - Be explicit in team code: If you're writing shared pipeline code, always pass
dropnaexplicitly (eitherTrueorFalse) so the intent is clear to reviewers. - Check your normalization denominators: Anywhere you use
normalize=True, verify the total against the original series length to confirm nothing was silently excluded. - Extend the pattern to groupby: Review how
NaNgroup keys behave in your aggregations β the behavior is different fromvalue_countsand equally easy to misread.
Frequently Asked Questions
How do I include NaN in pandas value_counts results?
Pass dropna=False to value_counts(), like series.value_counts(dropna=False). This tells pandas to treat NaN as a regular value and include it as a row in the output frequency table.
Does value_counts normalize=True include NaN in the percentage calculation?
Only when you also pass dropna=False. By default, normalize=True computes percentages over the non-null population, which inflates every percentage. With dropna=False the denominator is the full series length, giving you accurate proportions.
Why does my pandas value_counts output not add up to the total number of rows?
Pandas drops NaN values from value_counts by default, so the counts only cover non-missing rows. Add dropna=False to include missing values and make the totals match your full dataset size.
How can I count NaN values across all columns in a pandas DataFrame at once?
Use df.isna().sum() for absolute counts per column, or df.isna().mean() for the missing fraction. Both return a Series indexed by column name and are much faster than looping over value_counts for this purpose.
What is the difference between None and np.nan in pandas value_counts?
In most pandas column types, both None and np.nan are treated as missing values and grouped together under a single NaN label in the value_counts output. You do not need to handle them separately when using dropna=False.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!