Fixing Python Pandas dropna That Removes Rows With Partial NaN When You Need Complete Cases Only

July 02, 2026 8 min read 2 views

You call df.dropna() expecting to keep only the rows where every column has a real value, but your result set is much smaller than it should be. Rows that were missing just one optional field got wiped out alongside genuinely incomplete records. This is a very common source of silent data loss in cleaning pipelines.

The root cause is simple: dropna() uses how='any' by default, which means a single NaN anywhere in a row triggers removal. If your dataset has columns you don't care about for completeness, that default will throw away rows you wanted to keep.

What You'll Learn

  • How dropna()'s default behavior differs from what most people expect
  • When to use how='all' instead of the default how='any'
  • How to restrict dropna() to only the columns that actually matter
  • How to use thresh to require a minimum number of non-null values
  • How to audit your DataFrame for missing values before deciding on a strategy

Prerequisites

You need Python 3.8 or later and Pandas 1.3 or later installed. The examples below work on any Pandas version from 1.x through 2.x. You should be comfortable creating DataFrames and know what NaN means in a numeric context.

How dropna Actually Decides What to Drop

Before you can fix the problem, you need to understand the decision logic. dropna() has four keyword arguments that control its behavior:

Parameter Default What it controls
axis 0 Drop rows (0) or columns (1)
how 'any' Drop if any NaN present, or only if all values are NaN
thresh None Minimum number of non-NaN values required to keep a row
subset None Restrict the check to these column labels only

Most developers call df.dropna() with no arguments and assume it means "drop rows that are entirely empty." It doesn't. It means "drop rows where any column is NaN across all columns." That distinction is where the data loss happens.

The how Parameter: any vs all

The how parameter is the first thing to check when dropna() feels too aggressive.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id':  [1, 2, 3, 4],
    'email':    ['a@x.com', 'b@x.com', np.nan, 'd@x.com'],
    'phone':    [np.nan, '555-0100', '555-0101', np.nan],
    'country':  ['US', np.nan, 'CA', 'UK'],
})

print(df)
   user_id    email      phone country
0        1  a@x.com        NaN      US
1        2  b@x.com   555-0100     NaN
2        3      NaN   555-0101      CA
3        4  d@x.com        NaN      UK

With the default how='any', every row disappears because every row has at least one NaN:

df.dropna()          # Returns empty DataFrame

With how='all', only rows where every column is NaN are removed. In this dataset, none of the rows qualify, so nothing is dropped:

df.dropna(how='all') # Returns all 4 rows

Neither of those is necessarily what you want. The sweet spot is usually subset combined with the appropriate how.

The subset Parameter: Target Specific Columns

The subset parameter tells dropna() to only look at the columns you specify when deciding whether a row should be dropped. This is the single most useful parameter for complete-case analysis.

Imagine the data above represents a contact list and you require user_id and email to be present, but phone and country are optional. Pass them to subset:

# Keep rows only when user_id AND email are non-null
clean = df.dropna(subset=['user_id', 'email'])
print(clean)
   user_id    email      phone country
0        1  a@x.com        NaN      US
1        2  b@x.com   555-0100     NaN
3        4  d@x.com        NaN      UK

Row 2 (index 2) is gone because email was NaN. Rows 0, 1, and 3 survive even though they each have NaN in phone or country. This is "complete case" behavior relative to the columns that matter.

You can also combine subset with how='all' to drop rows only when all of the specified columns are null:

# Drop only when BOTH email and phone are missing
clean = df.dropna(subset=['email', 'phone'], how='all')
print(clean)
   user_id    email      phone country
0        1  a@x.com        NaN      US
1        2  b@x.com   555-0100     NaN
2        3      NaN   555-0101      CA
3        4  d@x.com        NaN      UK

All four rows survive here because no row is missing both email and phone at the same time.

The thresh Parameter: Require a Minimum of Valid Values

thresh gives you a numeric floor: keep a row only if it has at least N non-NaN values. It applies across the columns you specify in subset, or across all columns if subset is not set.

# Keep rows that have at least 3 non-null values across all 4 columns
clean = df.dropna(thresh=3)
print(clean)
   user_id    email      phone country
0        1  a@x.com        NaN      US
1        2  b@x.com   555-0100     NaN
2        3      NaN   555-0101      CA
3        4  d@x.com        NaN      UK

Every row has exactly 3 non-null values (one NaN each), so all four rows survive with thresh=3. If you raise it to 4, you're back to an empty result because no row is fully complete.

Note that thresh and how are mutually exclusive. If you pass both, Pandas will raise a TypeError. Choose one or the other.

Combining subset and thresh for Surgical Control

The most powerful pattern is using subset with thresh together. This lets you say: "among these specific columns, require at least N valid values."

df2 = pd.DataFrame({
    'order_id':   [101, 102, 103, 104, 105],
    'product_a':  [10.0, np.nan, 5.0, np.nan, 8.0],
    'product_b':  [np.nan, 20.0, np.nan, np.nan, 15.0],
    'product_c':  [7.0, 3.0, np.nan, np.nan, np.nan],
    'notes':      [np.nan, np.nan, np.nan, np.nan, 'gift'],
})

# Require at least 2 of the 3 product columns to be filled
clean = df2.dropna(subset=['product_a', 'product_b', 'product_c'], thresh=2)
print(clean)
   order_id  product_a  product_b  product_c notes
0       101       10.0        NaN        7.0   NaN
1       102        NaN       20.0        3.0   NaN
4       105        8.0       15.0        NaN  gift

Rows 102 (order_id 103) and 103 (order_id 104) are dropped because they each have fewer than two product values. The notes column is irrelevant to the decision because it's not in subset.

This approach maps directly to real analysis scenarios: survey data where you need at least half the questions answered, or sensor readings where a minimum number of channels must be active for the row to be usable.

Inspecting Your Data Before You Drop

Calling dropna() without first understanding your null pattern is how data loss sneaks in. Always audit first.

# Count nulls per column
print(df.isnull().sum())

# See what percentage of each column is null
print(df.isnull().mean().mul(100).round(1))

# Count rows that are fully complete
print(df.dropna().shape[0], 'fully complete rows out of', len(df))

If you want to see which rows would survive before committing to the drop, use boolean indexing with notna():

# Preview rows where email and user_id are both present
mask = df['email'].notna() & df['user_id'].notna()
print(df[mask])

This is the same logic that dropna(subset=['email', 'user_id']) applies internally, but it lets you inspect the mask before actually filtering. This technique is also useful for debugging pipelines, which you can read more about in the context of silent errors in Pandas apply() operations.

For a broader null-handling perspective β€” especially when you suspect upstream data ingestion is introducing unexpected nulls β€” the same principle of inspecting before acting applies to Pandas to_datetime producing NaT on mixed formats, where NaT values can accumulate silently in the same pipeline.

Common Pitfalls and Gotchas

Passing thresh and how together raises a TypeError

As mentioned, these two parameters conflict. You'll get a TypeError: cannot specify both 'how' and 'thresh' at runtime. Pick one. If you need "at least N valid values," use thresh. If you need "all or any," use how.

dropna does not catch empty strings

NaN and an empty string '' are not the same thing in Pandas. A cell containing '' will survive dropna() without complaint. If your source data uses empty strings to represent missing values, replace them first:

df.replace('', np.nan, inplace=True)
df.dropna(subset=['email'])

The same applies to sentinel values like 0, -1, or 'N/A' (as a literal string). You need to convert those to np.nan before dropna() will catch them. This is a close sibling of the issue covered in Pandas groupby returning NaN when group key contains None, where the distinction between Python None and np.nan causes similar confusion.

inplace=True modifies the original DataFrame silently

Using df.dropna(inplace=True) modifies df in place and returns None. If you assign the result to a new variable, that variable will be None. The safe default is to assign to a new name: clean_df = df.dropna(...).

Axis=1 drops columns, not rows

If you accidentally pass axis=1, dropna() will drop entire columns that contain any NaN. This is rarely what you want in a row-filtering context, and the result can be a DataFrame that looks right at a glance but is missing columns entirely.

Chaining dropna in a pipeline can compound drops

If you call dropna() multiple times at different stages of a pipeline, each call applies its own logic independently. A row that survived the first call may get dropped by the second. Keep your null-handling in one place where possible, or at least log the shape of the DataFrame after each step so you can trace where rows are disappearing.

# Defensive pipeline pattern
for step_name, step_df in [
    ('raw', df),
    ('after drop required cols', df.dropna(subset=['user_id', 'email'])),
]:
    print(f"{step_name}: {step_df.shape[0]} rows")

thresh counts non-NaN values, not NaN values

This trips people up because the name sounds like a threshold for how many NaNs are allowed. It's the opposite: thresh=3 means "keep rows that have at least 3 valid values." If you want to allow at most 1 NaN in a 4-column DataFrame, set thresh=3 (4 - 1 = 3).

Wrapping Up: Next Steps

Getting dropna() right is mostly about making your intent explicit instead of relying on defaults. Here are concrete actions to take:

  1. Audit nulls first. Run df.isnull().sum() and df.isnull().mean() before writing any dropna() call so you know the shape of the problem.
  2. Always pass subset. Identify which columns actually define a "complete case" for your analysis and pass only those. Never rely on the default all-columns behavior.
  3. Choose how or thresh, not both. Use thresh when you need a flexible minimum; use how when the logic is purely "any" or "all."
  4. Replace sentinel values before dropping. Convert empty strings, -1 placeholders, and literal 'N/A' strings to np.nan so dropna() actually sees them.
  5. Log row counts at each pipeline step. A simple print(df.shape) after each transformation makes silent data loss immediately visible.

Frequently Asked Questions

Why does pandas dropna remove all rows when my DataFrame has NaN in optional columns?

By default, dropna() uses how='any', which drops a row if even one column contains NaN regardless of whether that column matters to you. Use the subset parameter to restrict the null check to only the columns that must be non-null for your analysis.

How do I keep rows in pandas where only specific required columns have values?

Pass those column names to the subset parameter: df.dropna(subset=['col_a', 'col_b']). This tells dropna() to ignore NaN in all other columns and only drop rows where col_a or col_b is missing.

What is the difference between dropna thresh and dropna how in pandas?

how='any' or how='all' applies an all-or-nothing rule across the checked columns, while thresh=N requires at least N non-NaN values to keep the row. You cannot use both in the same call β€” passing both raises a TypeError.

Does pandas dropna treat empty strings the same as NaN?

No. An empty string '' is a valid Python object and is not considered NaN by Pandas. You need to replace empty strings with np.nan before calling dropna() if you want them to be treated as missing values.

How can I preview which rows dropna will remove before actually dropping them?

Build a boolean mask manually using notna() on the columns you care about, for example: mask = df['email'].notna() & df['user_id'].notna(), then inspect df[~mask] to see the rows that would be dropped before committing to the filter.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.