Fixing Silently Corrupt Parquet Files Written by pandas to S3
You write a Parquet file to S3, the upload finishes cleanly, and no exception is raised. Two hours later, a downstream job fails with a cryptic ArrowInvalid error or, worse, silently returns wrong results. The file looks fine from the outside β it has a size, a modification time, and a name ending in .parquet. But it's broken in a way that only shows up when something actually tries to read it.
This is one of the more frustrating classes of bugs in data engineering because the feedback loop is long and the error messages rarely point at the real cause. This article walks you through the mechanisms behind silent Parquet corruption, the tools you can use to catch it, and the code patterns that prevent it.
What You'll Learn
- Why Parquet files written through pandas + S3 can be corrupt without raising an error at write time
- How to validate a Parquet file before downstream jobs consume it
- Common schema and encoding pitfalls with PyArrow and fastparquet
- Safe write patterns for S3 using atomic uploads and checksums
- How to build a lightweight validation step into your pipeline
Prerequisites
You should be comfortable with pandas DataFrames, have basic familiarity with AWS S3 (boto3 or s3fs), and have PyArrow installed. The examples use Python 3.10+, pandas 2.x, and PyArrow 14+. The concepts apply equally if you're using fastparquet, though the API calls differ slightly.
Why Parquet Corruption Happens Silently
Parquet is a columnar format with a footer that contains schema and row-group metadata. When you write a file, the footer is written last. If the write process is interrupted after some row groups are flushed but before the footer is finalized, you get a file with valid-looking bytes at the start and a corrupt or missing footer at the end.
The S3 multipart upload API makes this worse. A multipart upload that is never completed leaves behind parts that are invisible to most S3 listing tools, but a partially completed upload can occasionally result in a zero-byte object or a truncated object depending on how your client handles the abort. Neither boto3 nor s3fs always raises an exception in these cases.
There is also a second failure mode that has nothing to do with network issues: schema drift. When you write multiple DataFrames to separate Parquet files that are later read as a dataset, even a single column whose dtype shifted from int64 to float64 across batches can cause a reader to fail or, more dangerously, silently coerce values.
Diagnosing the Problem
Check the footer first
PyArrow gives you a fast way to read only the Parquet metadata without pulling the full file into memory. If the footer is corrupt, this call will fail immediately.
import pyarrow.parquet as pq
import s3fs
fs = s3fs.S3FileSystem()
def check_parquet_footer(s3_path: str) -> bool:
try:
with fs.open(s3_path, "rb") as f:
meta = pq.read_metadata(f)
print(f"Rows: {meta.num_rows}, Row groups: {meta.num_row_groups}")
return True
except Exception as e:
print(f"Footer check failed for {s3_path}: {e}")
return False
This does not guarantee the data columns are intact, but it rules out the most common truncation failures immediately and costs only a few hundred milliseconds per file.
Validate the schema against a reference
If you own the schema, pin it and compare every file against it at write time or at the start of a downstream job.
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs
EXPECTED_SCHEMA = pa.schema([
pa.field("user_id", pa.int64()),
pa.field("event_ts", pa.timestamp("us", tz="UTC")),
pa.field("amount", pa.float64()),
])
def validate_schema(s3_path: str, fs: s3fs.S3FileSystem) -> bool:
with fs.open(s3_path, "rb") as f:
actual = pq.read_schema(f)
if actual != EXPECTED_SCHEMA:
print(f"Schema mismatch in {s3_path}")
print(f" Expected: {EXPECTED_SCHEMA}")
print(f" Got: {actual}")
return False
return True
Run this before any job that reads the file. Catching a float64 where you expect int64 early saves you from silent numeric coercion downstream.
Read a sample of actual row data
Footer and schema checks are fast but shallow. For a deeper check, read the first row group and verify row count and a few column statistics.
def spot_check_data(s3_path: str, fs: s3fs.S3FileSystem) -> bool:
try:
with fs.open(s3_path, "rb") as f:
pf = pq.ParquetFile(f)
first_batch = next(pf.iter_batches(batch_size=1000))
df = first_batch.to_pandas()
assert len(df) > 0, "First batch is empty"
assert df["user_id"].notna().all(), "Unexpected nulls in user_id"
return True
except Exception as e:
print(f"Spot check failed: {e}")
return False
This is particularly useful after a write that processed a large number of rows, where the chance of a row-group-level issue is higher.
Common Root Causes and Their Fixes
Interrupted multipart uploads
The most common cause of truncated files is a process that dies mid-upload. The default s3fs and boto3 multipart thresholds mean any file above around 100 MB will use multipart. If your process is killed by a timeout or OOM, the parts are uploaded but the CompleteMultipartUpload call never happens.
The fix is to always write to a temporary key and rename (copy + delete) atomically only after the write succeeds. S3 does not have a true atomic rename, but a copy-then-delete is safe enough for most pipelines because the destination key either exists completely or does not exist at all.
import boto3
import io
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def write_parquet_atomic(df: pd.DataFrame, bucket: str, final_key: str) -> None:
staging_key = final_key + ".staging"
s3 = boto3.client("s3")
# Write to staging key
buffer = io.BytesIO()
table = pa.Table.from_pandas(df, schema=EXPECTED_SCHEMA)
pq.write_table(table, buffer)
buffer.seek(0)
s3.put_object(Bucket=bucket, Key=staging_key, Body=buffer.getvalue())
# Validate staging file before promoting
staging_path = f"s3://{bucket}/{staging_key}"
fs = s3fs.S3FileSystem()
if not check_parquet_footer(staging_path) or not validate_schema(staging_path, fs):
s3.delete_object(Bucket=bucket, Key=staging_key)
raise ValueError(f"Validation failed for staged file: {staging_path}")
# Promote: copy to final key, then delete staging
s3.copy_object(
Bucket=bucket,
CopySource={"Bucket": bucket, "Key": staging_key},
Key=final_key,
)
s3.delete_object(Bucket=bucket, Key=staging_key)
This pattern means a downstream reader will never see a half-written file at the final key.
Mixed Python object columns
When pandas writes a column with dtype object that contains a mix of strings and None, PyArrow infers the type on the first non-null value. If the first value happens to be a string, everything is fine. If it's None, PyArrow may infer null type for the entire column, which then fails to cast correctly on read.
Always cast your columns explicitly before writing. Do not rely on PyArrow's type inference for nullable columns.
import pandas as pd
import pyarrow as pa
# Bad: lets PyArrow guess
table = pa.Table.from_pandas(df)
# Good: provide an explicit schema
schema = pa.schema([
pa.field("name", pa.string()),
pa.field("score", pa.float64()),
])
table = pa.Table.from_pandas(df, schema=schema)
Timestamps without timezone information
pandas stores timezone-naive timestamps as datetime64[ns]. PyArrow, by default, writes these as timestamp[us] without a timezone. When a reader like Spark or Athena encounters this column, it may interpret the values as local time or UTC depending on its own defaults, which means the same file can return different results in different environments.
Normalize all timestamps to UTC before writing:
df["event_ts"] = pd.to_datetime(df["event_ts"]).dt.tz_localize("UTC")
If the column is already timezone-aware but in a non-UTC zone, convert it:
df["event_ts"] = df["event_ts"].dt.tz_convert("UTC")
Integer columns with NaN values
Standard pandas integer columns (int64) cannot hold NaN. When a merge or groupby introduces missing values into an integer column, pandas silently promotes it to float64. The Parquet file is written with float column, and downstream integer comparisons start returning unexpected results.
Use pandas nullable integer types (Int64 with a capital I) for columns that may have missing values:
df["user_id"] = df["user_id"].astype("Int64") # pandas nullable integer
PyArrow maps Int64 correctly to int64 with a null bitmap in Parquet, preserving the intended type.
Adding an S3 Content Checksum
S3 supports MD5 and SHA-256 checksums on PutObject and multipart uploads. If you supply a checksum, S3 validates it on receipt and rejects the object if it does not match. This catches corruption introduced by the network or S3 itself, which is rare but not impossible.
import hashlib
import base64
def put_with_checksum(s3_client, bucket: str, key: str, data: bytes) -> None:
md5 = base64.b64encode(hashlib.md5(data).digest()).decode()
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=data,
ContentMD5=md5,
)
For large files using multipart, boto3's S3Transfer handles checksums at the part level if you pass ChecksumAlgorithm="SHA256" to create_multipart_upload. Check the boto3 documentation for the current multipart checksum API, as this interface has changed across versions.
Building Validation Into Your Pipeline
Running all of these checks ad hoc is useful for debugging, but what you actually want is a validation step baked into your write function that fails loudly and early. Here is a minimal validation harness:
from dataclasses import dataclass
from typing import Optional
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs
@dataclass
class ParquetValidator:
expected_schema: pa.Schema
min_rows: int = 1
fs: Optional[s3fs.S3FileSystem] = None
def __post_init__(self):
if self.fs is None:
self.fs = s3fs.S3FileSystem()
def validate(self, s3_path: str) -> None:
"""Raises ValueError on any validation failure."""
with self.fs.open(s3_path, "rb") as f:
meta = pq.read_metadata(f)
if meta.num_rows < self.min_rows:
raise ValueError(
f"{s3_path}: expected >= {self.min_rows} rows, got {meta.num_rows}"
)
with self.fs.open(s3_path, "rb") as f:
actual_schema = pq.read_schema(f)
if actual_schema != self.expected_schema:
raise ValueError(
f"{s3_path}: schema mismatch.\n"
f"Expected: {self.expected_schema}\nGot: {actual_schema}"
)
Call validator.validate(s3_path) immediately after your atomic write completes. If it raises, your pipeline fails fast at the write stage rather than silently propagating bad data.
Common Pitfalls to Watch For
- Using
df.to_parquet()directly with an S3 path β this bypasses your staging and validation logic. Wrap it instead of calling it directly from production code. - Appending to an existing Parquet dataset without schema enforcement β
pq.write_to_datasetwithschemaomitted will infer the schema from each batch independently. Pass your pinned schema explicitly. - Assuming file size equals correctness β a corrupt file can be exactly the same size as a valid one if the corruption is in column values rather than structure.
- Not cleaning up staging keys on failure β add a
try/finallyblock around your atomic write to delete the staging key even if validation fails. - Running validation in the same process that wrote the file β if the process is at risk of OOM, the validator may also fail. Consider running validation in a separate lightweight Lambda or Step Function task.
Wrapping Up
Silent Parquet corruption in S3 pipelines is almost always preventable. The fixes come down to a small set of habits: write explicitly typed schemas, normalize timestamps and nullable integers before writing, use staging keys and promote only after validation, and add a checksum to catch network-level corruption.
Here are your concrete next steps:
- Audit your existing write code and identify any call to
to_parquetorpq.write_tablethat does not supply an explicit schema. Fix those first. - Add a
ParquetValidator(or equivalent) call after every file write in your pipelines. Fail fast at the source rather than at the consumer. - Switch to the staging-key pattern for any file above 50 MB to eliminate partial-upload races.
- Run
pq.read_metadata()against your existing S3 Parquet files as a batch job to surface any files already in a broken state before a downstream job finds them for you. - Pin your PyArrow and pandas versions in your environment. Type inference behavior has changed across minor releases, and version drift between writers and readers is a frequent cause of schema mismatches in shared pipelines.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!