Fixing Silently Corrupt Parquet Files Written by pandas to S3

Parquet has become the default storage format for modern data pipelines. It offers excellent compression, columnar storage, efficient querying, and compatibility with tools like Spark, Athena, DuckDB, Snowflake, BigQuery, and pandas itself.

Yet one of the most frustrating issues data engineers encounter is discovering that Parquet files written by pandas appear successful, but later fail when read by downstream systems. The write operation completes without errors, the file exists in Amazon S3, and metadata appears normal. However, hours or days later, queries fail with mysterious messages such as:

ArrowInvalid: Invalid parquet file

OSError: Couldn't deserialize thrift data

Parquet magic bytes not found

HIVE_CURSOR_ERROR: Corrupted Parquet file

Even worse, some systems may silently skip affected files, causing incomplete analytics and inaccurate reporting without any obvious indication that data was lost.

If your pandas pipeline is generating Parquet files that occasionally become unreadable or inconsistent in S3, this guide will help you identify the root causes and implement reliable fixes.

Understanding the Problem

Most developers assume that once pandas successfully executes:

df.to_parquet(...)

the resulting file is valid.

Unfortunately, multiple components participate in the write process:

pandas
PyArrow
FastParquet
s3fs
boto3
S3 multipart uploads
Network layers
ETL orchestration tools

A failure in any layer can produce partially written or malformed files.

The challenge is that corruption is often not discovered until a downstream consumer attempts to read the file.

Common Symptoms

You may encounter:

Athena Query Failures

HIVE_CURSOR_ERROR

Failed to read Parquet file

Spark Errors

org.apache.parquet.io.ParquetDecodingException

PyArrow Exceptions

ArrowInvalid

Missing Rows

The file opens successfully but contains fewer rows than expected.

Inconsistent Schema

Different files in the same partition expose different column types.

Random Failures

One execution succeeds while another fails with identical code.

These symptoms often indicate an issue in the write process rather than the read process.

Root Cause #1: Interrupted Multipart Uploads

Large files uploaded to S3 typically use multipart uploads.

When an upload is interrupted:

Network failure
Container restart
Lambda timeout
Kubernetes pod eviction

S3 may contain incomplete upload artifacts.

Example:

df.to_parquet(
    "s3://my-bucket/output/data.parquet"
)

The process appears complete but the underlying multipart transfer may not have finalized properly.

Fix

Always verify upload completion.

Using boto3:

import boto3

s3 = boto3.client("s3")

response = s3.head_object(
    Bucket="my-bucket",
    Key="output/data.parquet"
)

print(response["ContentLength"])

Compare expected and actual file sizes.

Also configure retries:

from botocore.config import Config

config = Config(
    retries={
        "max_attempts": 10,
        "mode": "adaptive"
    }
)

Root Cause #2: Writing Directly to Final Destination

Many pipelines write directly to production locations.

Example:

df.to_parquet(
    "s3://analytics/orders/date=2026-06-18/data.parquet"
)

If the process crashes halfway through:

Readers may discover incomplete files.
Athena crawlers may catalog bad data.
Spark jobs may fail.

Better Approach

Write to a temporary location first.

temp_path = (
    "s3://analytics-temp/orders.parquet"
)

df.to_parquet(temp_path)

Validate the file.

Then move:

copy_object(...)
delete_object(...)

This creates atomic-like behavior.

Root Cause #3: Mixing PyArrow Versions

A surprisingly common issue is inconsistent PyArrow versions across environments.

For example:

Developer machine:
PyArrow 14

Production:
PyArrow 10

Schema metadata may differ.

Certain Parquet features introduced in newer versions may not be readable by older consumers.

Verify Versions

import pyarrow

print(pyarrow.__version__)

Standardize versions across:

Local development
CI/CD
Airflow workers
Kubernetes jobs
Spark clusters

Root Cause #4: Concurrent Writes

Imagine multiple workers writing to the same file:

worker_1
worker_2
worker_3

All attempting:

s3://bucket/output/orders.parquet

The last writer wins.

Partial overwrites can create unpredictable results.

Fix

Generate unique filenames.

Example:

import uuid

file_name = (
    f"{uuid.uuid4()}.parquet"
)

Result:

orders/
    a34f.parquet
    c18d.parquet
    f782.parquet

This pattern is standard in modern data lakes.

Root Cause #5: Memory Pressure During Serialization

Parquet generation requires memory.

Large DataFrames can exceed available RAM.

Example:

df.to_parquet(...)

Internally:

Data is serialized.
Row groups are created.
Buffers are compressed.

If memory becomes constrained:

Processes crash.
Containers restart.
Files become incomplete.

Monitor Memory Usage

Using psutil:

import psutil

print(
    psutil.virtual_memory().percent
)

For large datasets, write incrementally.

Root Cause #6: Schema Drift

Schema drift causes many "corruption" reports that are actually compatibility issues.

Example Day 1:

user_id: int64

Day 2:

user_id: string

Athena and Spark may fail when reading the partition.

Example:

df["user_id"] = (
    df["user_id"].astype(str)
)

while older files remain numeric.

Fix

Enforce schemas explicitly.

schema = {
    "user_id": "int64",
    "amount": "float64"
}

df = df.astype(schema)

Validate before writing.

Root Cause #7: FastParquet vs PyArrow Differences

pandas supports multiple engines:

df.to_parquet(
    path,
    engine="pyarrow"
)

df.to_parquet(
    path,
    engine="fastparquet"
)

Mixing engines across pipelines can create compatibility challenges.

Many organizations standardize on:

engine="pyarrow"

because it is generally more compatible with cloud analytics tools.

Recommended:

df.to_parquet(
    path,
    engine="pyarrow",
    compression="snappy"
)

Root Cause #8: S3 Eventual Consistency Assumptions

While Amazon S3 provides strong read-after-write consistency, applications sometimes introduce timing issues.

Example:

write parquet
immediately query Athena

Metadata systems may not yet reflect new partitions.

Developers often interpret resulting failures as file corruption.

Fix

Validate existence first.

s3.head_object(...)

Then update metadata catalogs.

Validate Every File After Writing

The simplest protection is immediate validation.

Example:

df.to_parquet(path)

Then:

test_df = pd.read_parquet(path)

Verify:

assert len(test_df) == len(df)

This catches many issues immediately.

Implement Checksum Verification

For mission-critical pipelines, generate checksums.

Example:

import hashlib

with open(
    "data.parquet",
    "rb"
) as f:
    checksum = hashlib.md5(
        f.read()
    ).hexdigest()

Store checksum metadata.

Verify after upload.

This detects corruption introduced during transfer.

Use Row Count Validation

Before writing:

expected_rows = len(df)

After reading:

actual_rows = len(
    pd.read_parquet(path)
)

Compare:

assert expected_rows == actual_rows

Many teams include this validation in Airflow or ETL workflows.

Recommended Production Write Pattern

A robust workflow looks like this:

1. Generate DataFrame
2. Write locally
3. Validate file
4. Upload to S3
5. Verify upload
6. Read back sample
7. Move to production path
8. Update metadata catalog

This approach dramatically reduces corruption incidents.

Example:

df.to_parquet(
    "/tmp/orders.parquet",
    engine="pyarrow",
    compression="snappy"
)

pd.read_parquet(
    "/tmp/orders.parquet"
)

upload_to_s3()

verify_upload()

publish()

Monitoring and Alerting

Prevent future surprises by monitoring:

Upload failures
Retry counts
File size anomalies
Row count mismatches
Schema changes
Athena read failures
Spark read failures

Useful tools include:

CloudWatch
Datadog
Prometheus
Grafana
Airflow monitoring

Alerts often detect corruption long before users notice missing data.

Best Practices Checklist

Before writing Parquet files to S3:

✅ Use PyArrow consistently

✅ Write to temporary paths first

✅ Avoid concurrent writes

✅ Validate row counts

✅ Validate schemas

✅ Read files back after writing

✅ Verify S3 uploads

✅ Monitor upload failures

✅ Use compression consistently

✅ Track file sizes and checksums

Following these practices prevents most corruption scenarios seen in production data pipelines.

Conclusion

Silently corrupt Parquet files are among the most dangerous data engineering problems because they often remain undetected until analytics, dashboards, or machine learning systems begin producing incorrect results. The write operation may appear successful, yet underlying issues such as interrupted uploads, schema drift, concurrent writes, memory pressure, or inconsistent library versions can leave downstream systems struggling to read the data.

The solution is not simply writing Parquet files—it is validating them. By implementing temporary write locations, schema enforcement, upload verification, row count checks, checksum validation, and consistent PyArrow configurations, you can transform a fragile pipeline into a reliable production-grade system.

A few extra validation steps during file creation are far less expensive than discovering weeks later that critical business reports were built on corrupted data.

Fixing Silently Corrupt Parquet Files Written by pandas to S3

Athena Query Failures

Spark Errors

PyArrow Exceptions

Missing Rows

Inconsistent Schema

Random Failures

Fix

Better Approach

Verify Versions

Fix

Monitor Memory Usage

Fix

Fix

Related Articles

Fixing AWS CloudFront Cache Invalidations That Still Serve Stale Content

Sentry vs Highlight.io for Error Monitoring: Pricing, Session Limits, and Real Noise

Fixing Silent Failures When Nginx Truncates Upstream Responses

Comments (0)

Leave a Comment

Fixing Silently Corrupt Parquet Files Written by pandas to S3

Athena Query Failures

Spark Errors

PyArrow Exceptions

Missing Rows

Inconsistent Schema

Random Failures

Fix

Better Approach

Verify Versions

Fix

Monitor Memory Usage

Fix

Fix

Related Articles

Fixing AWS CloudFront Cache Invalidations That Still Serve Stale Content

Sentry vs Highlight.io for Error Monitoring: Pricing, Session Limits, and Real Noise

Fixing Silent Failures When Nginx Truncates Upstream Responses

Comments (0)

Leave a Comment

Stay ahead of the curve