Fixing Python CSV DictReader Skipping Rows When Encoding Is Wrong
You open a CSV file with csv.DictReader, loop over the rows, and something is clearly off β the first key has garbage characters prepended, rows are missing, or values show up under None instead of the column name you expected. The file looks perfectly fine in Excel. The problem is almost always encoding.
Python's csv module does not auto-detect encoding. It reads bytes and interprets them based on whatever encoding you (or the default) told it to use. When that assumption is wrong, the results range from subtle key corruption to silently dropped rows.
What you'll learn
- Why wrong encoding causes
DictReaderto skip or mangle rows - How to detect the actual encoding of any CSV file
- How to open files correctly, including UTF-8 with BOM
- How to handle multi-encoding pipelines without crashing
- Common gotchas and how to avoid them
Prerequisites
You need Python 3.6 or later. The chardet library is used in one section β install it with pip install chardet. Basic familiarity with reading files in Python is assumed.
Why Encoding Breaks DictReader
When Python opens a file in text mode (the default), it decodes bytes into Unicode strings. If the declared encoding does not match what's actually in the file, the decoder either raises a UnicodeDecodeError or β worse β silently replaces bytes with the Unicode replacement character U+FFFD. The second case is the dangerous one because your code keeps running with corrupted data.
With DictReader, the first row is read as the header. If that row is corrupted, every subsequent row gets matched against the wrong keys. The rows are not technically skipped β they are stored under mangled or unexpected key names, so your downstream row['column_name'] lookups return None or raise KeyError. It looks like rows are missing when they are actually there under broken keys.
A special case is UTF-8 with BOM (UTF-8-SIG). Files saved by Excel on Windows often include a three-byte BOM at the start. If you open them as plain utf-8, the BOM ends up glued to the first column name. You get a key like '\ufeffid' instead of 'id', and every lookup on that column fails silently.
Step 1: Identify the File's Actual Encoding
Before writing any fix, confirm what encoding the file actually uses. The chardet library inspects the raw bytes and makes a best-guess prediction.
import chardet
with open('data.csv', 'rb') as f:
raw = f.read()
result = chardet.detect(raw)
print(result)
# {'encoding': 'UTF-8-SIG', 'confidence': 0.99, 'language': ''}
Read the file in binary mode ('rb') so Python does not try to decode it first. chardet.detect returns a dict with the detected encoding and a confidence score. A confidence above 0.90 is generally reliable for common encodings like UTF-8, UTF-8-SIG, and latin-1.
If the file is large, you can sample just the first chunk rather than reading everything into memory:
import chardet
with open('data.csv', 'rb') as f:
sample = f.read(50_000) # first 50 KB is usually enough
result = chardet.detect(sample)
print(result['encoding'])
Step 2: Open the File With the Correct Encoding
Once you know the encoding, pass it explicitly to open(). Never rely on the system default, which varies by platform and locale.
import csv
with open('data.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
A few things to note here. The newline='' argument is required by the csv module β it prevents the universal newline translation from interfering with CSV line endings. Using utf-8-sig strips the BOM automatically if one is present, and reads the file normally if there is no BOM, so it is safe to use as a default for Excel-generated files.
If chardet reported latin-1 or windows-1252 (common for older European exports), use that encoding instead:
import csv
with open('data.csv', newline='', encoding='windows-1252') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
Step 3: Handle Decode Errors Gracefully
Sometimes you are processing files from untrusted sources and you cannot guarantee the encoding is clean. The errors parameter of open() controls what happens when a byte sequence cannot be decoded.
import csv
with open('data.csv', newline='', encoding='utf-8', errors='replace') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
The 'replace' option substitutes the Unicode replacement character for any undecodable byte, keeping the row intact with a visible marker. Use 'ignore' if you want to drop the bad bytes entirely, but be careful β silently removing characters from values can corrupt data in ways that are hard to trace later.
A third option, 'backslashreplace', encodes bad bytes as escape sequences like \x9c, which makes it easy to spot and log them without crashing or silently corrupting values.
Step 4: Auto-Detect Encoding for Unknown Files
When you are building a pipeline that ingests files from multiple sources, hard-coding an encoding is fragile. A robust approach auto-detects the encoding and falls back to UTF-8 if detection fails.
import csv
import chardet
def read_csv_safe(filepath):
# Detect encoding from a sample
with open(filepath, 'rb') as f:
raw_sample = f.read(50_000)
detected = chardet.detect(raw_sample)
encoding = detected.get('encoding') or 'utf-8'
confidence = detected.get('confidence', 0)
if confidence < 0.7:
print(f"Warning: low confidence ({confidence:.0%}) on encoding '{encoding}'. Defaulting to utf-8.")
encoding = 'utf-8'
with open(filepath, newline='', encoding=encoding, errors='replace') as f:
reader = csv.DictReader(f)
return list(reader)
rows = read_csv_safe('data.csv')
print(f"Loaded {len(rows)} rows with keys: {list(rows[0].keys())}")
This function reads a sample, checks confidence, falls back to UTF-8 when uncertain, and uses errors='replace' as a safety net. It is not perfect β no auto-detection strategy is β but it is far more reliable than letting Python silently mangle your data.
Step 5: Verify Your Headers After Opening
Once the file is open, do a quick sanity check on the field names before processing any rows. This catches encoding issues that survived detection and also flags CSV structure problems like extra columns.
import csv
expected_fields = {'id', 'name', 'email', 'amount'}
with open('data.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
actual_fields = set(reader.fieldnames or [])
missing = expected_fields - actual_fields
unexpected = actual_fields - expected_fields
if missing:
raise ValueError(f"Missing expected columns: {missing}")
if unexpected:
print(f"Warning: unexpected columns found: {unexpected}")
for row in reader:
# safe to process now
print(row['id'], row['name'])
Checking reader.fieldnames before iterating is a lightweight assertion that pays off every time a file comes in with a BOM-corrupted header or an extra column inserted by an upstream system.
Common Pitfalls
Forgetting newline=''
On Windows, opening a CSV in text mode without newline='' causes Python's universal newline handling to swallow \r characters. This can split rows incorrectly when a field value contains a literal carriage return. Always pass newline='' and let the csv module handle line endings.
Trusting Excel's "Save as CSV"
Excel on Windows saves CSV files in either UTF-8 with BOM or in the system locale encoding (often Windows-1252 on Western systems). If you open the file as plain utf-8, you will either hit the BOM problem or a decode error on any accented characters. Use utf-8-sig as your default for Excel exports.
Using io.StringIO With Raw Bytes
If you are reading CSV data from an API response or a database blob, you might be tempted to wrap the raw bytes in io.StringIO. That will fail with a TypeError because StringIO expects strings, not bytes. Decode the bytes first, then wrap:
import csv
import io
# response.content is bytes from an HTTP response
decoded = response.content.decode('utf-8-sig')
reader = csv.DictReader(io.StringIO(decoded))
Encoding Detection on Very Short Files
chardet needs enough bytes to make a confident prediction. Files under a few hundred bytes often come back with low confidence. For tiny files, just try utf-8-sig first and fall back to latin-1 on failure β latin-1 decodes any byte sequence without raising an error, making it a reliable last resort (though values may still look wrong if the original encoding was something else).
Mixed Encodings in One File
Occasionally you will encounter a file where different sections were written by different tools with different encodings. No standard library function handles this gracefully. Your best option is to read the file in binary mode, split it into lines manually, and decode each line individually with error handling. This is rare but worth knowing about when chardet confidence fluctuates between rows.
Wrapping Up
Encoding mismatches are one of the quietest bugs in data pipelines β no exception, just wrong results. Here are the concrete steps to take right now:
- Run
chardet.detecton any CSV you did not create yourself to confirm its encoding before writing a single line of parsing code. - Always pass
newline=''and an explicitencodingargument toopen()when reading CSV files. - Use
encoding='utf-8-sig'as your default for any file that may have come from Excel on Windows. - Assert on
reader.fieldnamesbefore processing rows so encoding corruption is caught immediately, not three hours into a data run. - For automated pipelines with mixed sources, wrap the file-open logic in a helper function that auto-detects encoding and falls back safely rather than scattering encoding assumptions across your codebase.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!