Fixing Python Pandas to_datetime That Silently Produces NaT on Mixed Formats
You load a CSV, call pd.to_datetime(df['date_col']), and everything looks fine β until you notice your time-series chart is missing data, or a merge on the date column drops half your rows. The dates didn't raise an error; they silently became NaT. That's the worst kind of bug: quiet, confident, and wrong.
This article walks through exactly why to_datetime does this, how to detect the damage, and four concrete strategies to parse mixed-format date columns without losing a single row.
What you'll learn
- Why
pd.to_datetimesilently producesNaTinstead of raising an error - How to diagnose the scope of the problem before you fix it
- How to handle date columns that contain two or more incompatible formats
- When to use
dateutiland when to avoid it - Patterns that prevent this class of bug from recurring
The Silent Data Loss Problem
Pandas to_datetime is designed to be fast. When you give it a column, it attempts to parse every value using either the format you specify or a format it infers from the first few rows. If a value doesn't match, and you've set errors='coerce' (or if coerce is the default behavior in your call), that value becomes NaT β Not a Time β with no warning.
In a dataset sourced from multiple systems or spreadsheets, date strings almost always vary. One system writes 2024-03-15, another writes 15/03/2024, a third writes March 15, 2024. Run to_datetime once and you'll silently lose two-thirds of your dates.
Why to_datetime Produces NaT Instead of Raising an Error
By default, pd.to_datetime uses errors='raise' β which means it should raise a ValueError on any unparseable value. The silent-NaT problem almost always happens because someone explicitly passed errors='coerce', or because a utility function in a pipeline added it quietly.
There's a second, subtler cause. When you specify a format string like '%Y-%m-%d', Pandas uses a fast C-level parser and raises an error on mismatches β unless you have errors='coerce'. Without an explicit format, Pandas uses dateutil.parser.parse under the hood, which is more lenient but also more unpredictable across locales and ambiguous strings.
The key insight:
errors='coerce'is not a minor convenience flag. It changes the function from a strict parser into one that swallows bad data silently. Use it knowingly or not at all.
How Mixed Formats Appear in Real Data
Mixed date formats show up most often when data is consolidated from multiple sources: exports from different CRMs, legacy spreadsheets, or API responses from international partners. Here's a representative example:
import pandas as pd
data = {
'order_id': [1, 2, 3, 4, 5],
'order_date': [
'2024-01-15', # ISO format
'15/01/2024', # UK day-first format
'Jan 15, 2024', # natural language
'2024-01-15T09:30', # ISO with time
'01-15-2024', # US month-first with dashes
]
}
df = pd.DataFrame(data)
print(df['order_date'])
If you now run pd.to_datetime(df['order_date'], errors='coerce'), you may get three or four NaT values depending on which format Pandas guesses first. The result looks like this:
result = pd.to_datetime(df['order_date'], errors='coerce')
print(result)
# 0 2024-01-15
# 1 NaT <-- silently lost
# 2 NaT <-- silently lost
# 3 2024-01-15 09:30:00
# 4 NaT <-- silently lost
Three valid dates are gone. No exception was raised.
The Default Behavior: errors='coerce' Is a Double-Edged Sword
The errors parameter has three modes. Understanding them is the foundation of every fix below.
| errors value | On bad input | When to use it |
|---|---|---|
'raise' |
Raises ValueError immediately |
During development; you want to know about bad data |
'coerce' |
Returns NaT silently |
Only after you've verified the column is clean except for genuine nulls |
'ignore' |
Returns original value unchanged | Rarely β leaves the column as object dtype, which causes its own bugs |
For mixed-format columns, none of these alone solves the problem. You need a parsing strategy, not just a flag.
Diagnosing How Many NaTs You Actually Have
Before you fix anything, measure the damage. This snippet tells you both the count and a sample of the original strings that failed to parse.
parsed = pd.to_datetime(df['order_date'], errors='coerce')
# Count NaTs
nat_count = parsed.isna().sum()
print(f"{nat_count} values failed to parse ({nat_count / len(parsed):.1%} of column)")
# Show the original strings that produced NaT
failed_values = df.loc[parsed.isna(), 'order_date']
print(failed_values.value_counts())
The value_counts() output is particularly useful. If you see only two or three distinct failing patterns, you have a finite number of formats to handle. If you see dozens of unique strings, you may have data quality problems that need fixing before parsing.
This diagnostic step connects to a broader pattern described in our article on fixing Pandas groupby issues caused by None in group keys β in both cases, identifying the exact rows with problematic values is the first step before applying any fix.
Fix 1: Use a Single Canonical Format Upfront
If you control the data pipeline, the cleanest fix is to normalize dates to one format before they reach Pandas. If you're reading from a CSV with a known format, specify it explicitly:
# Fast and strict β raises immediately on any mismatch
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
This approach is the fastest because it bypasses the dateutil heuristic entirely and uses Pandas' C-level parser. It also makes the format contract explicit in code, which is useful documentation for future maintainers.
If your data source is inconsistent (e.g., a user-uploaded CSV), you won't be able to guarantee one format. Move on to Fix 2.
Fix 2: Try Multiple Formats With a Fallback Loop
When you know the finite set of formats in your data, loop through them in order of prevalence. Parse what each format can handle, and fill in the remaining NaT values on each pass:
import pandas as pd
import numpy as np
formats_to_try = [
'%Y-%m-%d',
'%Y-%m-%dT%H:%M',
'%d/%m/%Y',
'%m-%d-%Y',
'%b %d, %Y',
]
def parse_mixed_dates(series, formats):
result = pd.Series([pd.NaT] * len(series), index=series.index)
remaining_mask = pd.Series([True] * len(series), index=series.index)
for fmt in formats:
if not remaining_mask.any():
break
# Only attempt to parse rows not yet successfully parsed
subset = series[remaining_mask]
parsed = pd.to_datetime(subset, format=fmt, errors='coerce')
succeeded = parsed.notna()
result[subset[succeeded].index] = parsed[succeeded]
remaining_mask[subset[succeeded].index] = False
# Report anything still unparsed
still_failed = remaining_mask.sum()
if still_failed:
print(f"Warning: {still_failed} values could not be parsed by any format.")
print(series[remaining_mask].value_counts().to_string())
return result
df['order_date_parsed'] = parse_mixed_dates(df['order_date'], formats_to_try)
print(df[['order_date', 'order_date_parsed']])
This gives you full control: you define the formats, you decide the priority, and you get an explicit warning instead of silent data loss. The downside is that you must know your formats in advance.
Fix 3: Use dateutil for Flexible Parsing
When you can't enumerate every format, dateutil.parser.parse handles a surprisingly wide range of natural-language and ambiguous date strings. Apply it with apply() and wrap it in a try/except to surface failures rather than lose them:
from dateutil import parser as dateutil_parser
import pandas as pd
def safe_dateutil_parse(value):
if pd.isna(value) or str(value).strip() == '':
return pd.NaT
try:
return dateutil_parser.parse(str(value))
except (ValueError, OverflowError):
print(f"Could not parse: {value!r}")
return pd.NaT
df['order_date_parsed'] = df['order_date'].apply(safe_dateutil_parse)
The important caveat: dateutil makes assumptions about ambiguous dates. The string 01/02/03 could be January 2nd 2003, February 1st 2003, or February 3rd 2001 depending on locale settings. If your data contains dates like these, you need to pass dayfirst=True or yearfirst=True to the parser:
from functools import partial
# Enforce day-first interpretation for European date strings
parse_eu = partial(dateutil_parser.parse, dayfirst=True)
def safe_eu_parse(value):
if pd.isna(value) or str(value).strip() == '':
return pd.NaT
try:
return parse_eu(str(value))
except (ValueError, OverflowError):
return pd.NaT
df['order_date_parsed'] = df['order_date'].apply(safe_eu_parse)
Note that apply() with a Python function is slower than vectorized Pandas operations, sometimes by an order of magnitude on large datasets. For a column with millions of rows, Fix 2's format-looping approach will be significantly faster.
Fix 4: Parse Column Subsets by Format Group
If you can identify which rows follow which format β perhaps through a source system column, a filename pattern, or a string-length heuristic β you can split the column, parse each subset with the correct format, and concatenate the results:
import pandas as pd
# Heuristic: ISO dates contain a '-' in position 4; UK dates contain '/'
iso_mask = df['order_date'].str.match(r'^\d{4}-\d{2}-\d{2}')
uk_mask = df['order_date'].str.match(r'^\d{2}/\d{2}/\d{4}$')
us_mask = df['order_date'].str.match(r'^\d{2}-\d{2}-\d{4}$')
df['order_date_parsed'] = pd.NaT
df.loc[iso_mask, 'order_date_parsed'] = pd.to_datetime(
df.loc[iso_mask, 'order_date'], format='%Y-%m-%d'
)
df.loc[uk_mask, 'order_date_parsed'] = pd.to_datetime(
df.loc[uk_mask, 'order_date'], format='%d/%m/%Y'
)
df.loc[us_mask, 'order_date_parsed'] = pd.to_datetime(
df.loc[us_mask, 'order_date'], format='%m-%d-%Y'
)
# Anything not covered by a mask is left as NaT β now intentional and visible
unhandled = df['order_date_parsed'].isna() & df['order_date'].notna()
if unhandled.any():
print("Unhandled formats:", df.loc[unhandled, 'order_date'].unique())
This is the most explicit approach and the easiest to audit. Each format rule is visible in code, regex patterns are testable in isolation, and unhandled values are surfaced deliberately.
A similar philosophy applies when debugging silent row losses in other Pandas operations β for example, our walkthrough of why Pandas groupby drops rows when group keys contain None shows how isolating and identifying affected rows first leads to cleaner fixes.
Common Pitfalls to Avoid
Assuming infer_datetime_format is reliable
The infer_datetime_format=True parameter (deprecated in recent Pandas versions) samples the first few rows to guess the format. If your first rows happen to be ISO format but later rows are UK format, the inferred format silently fails on everything else. Never rely on inference for columns you haven't profiled.
Using errors='ignore' instead of errors='coerce'
Passing errors='ignore' leaves the entire column as object dtype when any parse failure occurs. This is arguably worse than NaT because the column looks like it parsed β operations that expect a datetime column simply fail later, further from the root cause. Avoid this flag unless you have a very specific reason.
Forgetting timezone handling
A column that mixes timezone-aware and timezone-naive strings will produce NaT or a TypeError depending on Pandas version. If your data includes ISO 8601 strings with offsets like 2024-01-15T09:30:00+01:00 alongside naive strings, parse them separately and convert to UTC before combining:
# Force UTC on timezone-aware subset
aware_parsed = pd.to_datetime(aware_series, utc=True)
# Then convert naive to UTC-assumed before concat
naive_utc = pd.to_datetime(naive_series).dt.tz_localize('UTC')
Silently losing data in a read_csv call
If you pass parse_dates=['order_date'] to pd.read_csv() on a mixed-format column, the result depends on the Pandas version and the data. Check the dtype of the resulting column immediately after loading. If it's object, parsing failed silently. Treat date parsing as a deliberate post-load step, not something to delegate to the CSV reader. This is similar in spirit to the silent row-skipping behaviour described in fixing PostgreSQL COPY FROM that silently skips rows on null delimiter mismatch β both issues share the same root cause: implicit parsing decisions that fail without feedback.
Treating NaT as acceptable without measuring first
Some engineers reach for errors='coerce' and then dropna() without checking how many rows they dropped. A column that is 3% null is very different from one that is 40% null after parsing. Always run the diagnostic from the earlier section before deciding what to do with failed values.
Wrapping Up
Silent NaT from to_datetime is a data quality bug, not just a Pandas quirk. The fix is always some combination of: know your formats, parse deliberately, and surface failures explicitly rather than swallowing them.
Here are your concrete next steps:
- Run the diagnostic on any existing date column you currently parse with
errors='coerce'. Find out exactly how many values are actually becomingNaT. - Choose the right fix for your situation: Fix 1 for clean single-format data, Fix 2 for known multi-format columns, Fix 3 for unpredictable natural language, Fix 4 for structurally segmented data.
- Add an assertion after date parsing in your data pipeline:
assert df['order_date'].isna().sum() == expected_nulls. This catches regressions immediately when source format changes. - Document the formats your date column accepts as a comment above the parsing call. Future maintainers will thank you.
- Profile performance if your dataset is large: use Fix 2's format-loop approach for datasets over 100k rows rather than
apply()with dateutil.
Frequently Asked Questions
Why does pd.to_datetime return NaT for some values but not others in the same column?
This usually happens when the column contains dates in multiple formats. Pandas either infers a format from the first few rows or you supply one explicitly, and any value that doesn't match that format becomes NaT when errors='coerce' is set. The fix is to parse each format group separately or use a flexible parser like dateutil.
Is it safe to use errors='coerce' with pd.to_datetime?
It's safe only after you've verified the column is clean except for genuinely null values. Using errors='coerce' on an unvalidated column masks format mismatches as NaT, causing silent data loss. Always run a diagnostic to check how many NaTs you're producing before committing to this flag in a pipeline.
How can I parse a date column that has both ISO 8601 and European date formats in the same CSV?
The most reliable approach is to split the column by format using regex masks, parse each subset with the appropriate format string, and combine the results back into one column. This gives you a strict, auditable parsing step and surfaces any strings that don't match a known format as deliberately unhandled NaT values.
Does dateutil.parser handle all mixed date formats automatically?
Dateutil handles a wide variety of formats but makes assumptions about ambiguous dates like '01/02/03', which can vary by locale. You should pass dayfirst=True or yearfirst=True to control its interpretation, and always wrap it in a try/except to catch values it cannot handle rather than letting them become silent NaTs.
Why does my date column come out as object dtype after using parse_dates in pd.read_csv?
When pd.read_csv encounters dates it cannot parse, it leaves the column as object dtype rather than raising an error. This means parsing silently failed. Treat date parsing as an explicit post-load step with pd.to_datetime and your chosen format strategy, rather than delegating it to the CSV reader.
π€ Share this article
Sign in to saveRelated Articles
How-To Guides
Fixing Python Pandas dropna That Removes Rows With Partial NaN When You Need Complete Cases Only
8m read
How-To Guides
Fixing Python Pandas apply() That Silently Ignores Errors on Axis=1
8m read
How-To Guides
Fixing PostgreSQL COPY FROM That Silently Skips Rows With Null Delimiter Mismatch
10m read
Comments (0)
No comments yet. Be the first!