Fixing Python Script Memory Leaks When Processing Large CSV Files

May 21, 2026 7 min read 40 views
Minimalist illustration of a data pipeline connecting a CSV file icon to a small memory chip on a soft gradient background

Your Python script runs fine on a 10 MB test file, but when you point it at the real 3 GB production CSV, memory climbs until the process gets killed. This is one of the most common β€” and most fixable β€” problems in data processing scripts. The root cause is almost always the same: you're holding more data in memory than you need to at any given moment.

This guide walks you through diagnosing where memory is going, then gives you practical patterns to stop the leak before it starts.

What you'll learn

  • How to profile memory usage in a running Python script
  • Why pandas.read_csv() can silently consume far more RAM than your file size suggests
  • How to use chunking and generators to process rows without loading everything at once
  • Common pitfalls that cause objects to stay alive longer than expected
  • Practical patterns for writing memory-safe CSV pipelines

Prerequisites

You should be comfortable writing Python functions and have used pandas at least once. Code examples use Python 3.9+ and pandas 1.5 or later, though the core ideas apply to earlier versions too. Install memory-profiler and tracemalloc (built into the standard library) before you start.

pip install memory-profiler pandas psutil

Why CSV Processing Leaks Memory

When you call pd.read_csv("big_file.csv"), pandas reads every row into a DataFrame that lives in memory for as long as your script holds a reference to it. A 1 GB CSV can expand to 4–6 GB in memory because pandas stores columns as typed arrays with overhead, infers dtypes, and sometimes creates intermediate copies during operations.

The leak isn't always a single call. It compounds: you read a file, transform it, store the result, and never delete the original. Now you have two full copies in RAM. Add a loop that processes batches without clearing previous batches, and memory climbs with every iteration.

Three patterns cause the vast majority of these leaks:

  • Loading the entire file at once when you only need one row at a time
  • Accumulating results in a list inside a loop without bounding the list's size
  • Holding references to DataFrames in long-lived objects like class attributes or module-level variables

Profiling Before You Fix Anything

Guessing where memory goes is slower than measuring. Python gives you two good built-in tools: tracemalloc for tracking allocations and memory-profiler for line-by-line reporting.

Using tracemalloc

Wrap your processing code with tracemalloc snapshots to see which lines allocated the most memory.

import tracemalloc
import pandas as pd

tracemalloc.start()

df = pd.read_csv("large_file.csv")
result = df[df["value"] > 100]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

for stat in top_stats[:10]:
    print(stat)

The output shows filename, line number, and total bytes allocated. Look for lines that show unexpectedly large allocations β€” those are your targets.

Using memory-profiler

Add the @profile decorator to any function and run it with mprof run. You get a report showing memory at each line of that function.

from memory_profiler import profile
import pandas as pd

@profile
def process_file(path):
    df = pd.read_csv(path)
    total = df["amount"].sum()
    return total

process_file("large_file.csv")
python -m memory_profiler your_script.py

Run this on a representative file. The per-line memory delta column tells you exactly which operations are expensive.

Pattern 1: Read in Chunks Instead of All at Once

The simplest fix for a whole-file read is chunking. pandas.read_csv() accepts a chunksize parameter that returns an iterator of smaller DataFrames instead of one giant one.

import pandas as pd

chunk_size = 50_000  # rows per chunk
results = []

for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
    # Process only this chunk
    filtered = chunk[chunk["status"] == "active"]
    results.append(filtered[["id", "amount"]].copy())

final_df = pd.concat(results, ignore_index=True)

Notice the .copy() on the filtered result. Without it, the filtered slice holds a reference to the full chunk DataFrame, which keeps the entire chunk alive in memory even after the loop variable moves on.

The right chunk size depends on your machine's available RAM and your row width. Start at 50,000 rows and adjust. You want chunks large enough to keep I/O efficient but small enough that you never hold more than a few hundred MB at once.

Pattern 2: Use a Generator for Row-by-Row Processing

If you don't need pandas features at all β€” you're just reading, transforming, and writing rows β€” the built-in csv module with a generator is the most memory-efficient option. Only one row exists in memory at a time.

import csv

def read_rows(filepath):
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

def process_rows(filepath, output_path):
    with open(output_path, "w", newline="", encoding="utf-8") as out:
        writer = None
        for row in read_rows(filepath):
            # Transform the row however you need
            row["amount"] = float(row["amount"]) * 1.1

            if writer is None:
                writer = csv.DictWriter(out, fieldnames=row.keys())
                writer.writeheader()
            writer.writerow(row)

process_rows("large_file.csv", "output.csv")

This pattern scales to files of any size because your memory footprint is constant: one row in, one row out. The tradeoff is that you lose pandas' vectorized operations, so it's slower on computationally heavy transformations.

Pattern 3: Reduce DataFrame Memory with Dtype Optimization

Sometimes you need pandas and you need the whole dataset. In that case, reduce the memory footprint by specifying dtypes explicitly instead of letting pandas infer them.

Pandas will store a column of small integers as int64 by default. If your values never exceed 255, you can use uint8 and use eight times less memory for that column.

import pandas as pd

dtype_map = {
    "user_id": "int32",
    "status_code": "uint8",
    "category": "category",   # string columns with few unique values
    "amount": "float32",
    "description": "string",  # uses pandas StringDtype, more efficient than object
}

df = pd.read_csv("large_file.csv", dtype=dtype_map)
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")

The "category" dtype is especially effective for string columns where the same values repeat frequently (status labels, country codes, product categories). Instead of storing each string separately, pandas stores an integer code and a lookup table.

Pattern 4: Delete What You No Longer Need

Python's garbage collector frees objects when no references to them remain. The problem is that references linger in ways that aren't obvious.

import pandas as pd
import gc

def process_in_stages(path):
    # Stage 1: filter
    df = pd.read_csv(path)
    filtered = df[df["active"] == 1].copy()
    del df          # drop the original immediately
    gc.collect()    # force collection; useful after large deletions

    # Stage 2: aggregate
    summary = filtered.groupby("region")["amount"].sum().reset_index()
    del filtered
    gc.collect()

    return summary

Use del as soon as you're done with a large object, then call gc.collect() to reclaim the memory immediately rather than waiting for Python's scheduled collection cycle. This matters most inside loops.

Common Pitfalls

Accumulating chunks in a list without limit

Chunking only helps if you process and discard each chunk. If you append every filtered chunk to a list and only concatenate at the end, you still end up with the entire dataset in memory β€” plus the overhead of the list and temporary DataFrames during concat.

Fix this by writing processed chunks to disk incrementally and doing a final read only if you need the full result for a subsequent step.

import pandas as pd

output_path = "processed.csv"
header_written = False

for chunk in pd.read_csv("large_file.csv", chunksize=50_000):
    processed = chunk[chunk["status"] == "active"].copy()
    processed.to_csv(
        output_path,
        mode="a",
        header=not header_written,
        index=False
    )
    header_written = True

String columns loaded as object dtype

Pandas loads text columns as object dtype by default. An object column is essentially a Python list of Python strings β€” each string is a separate heap allocation with its own reference count overhead. On a column with millions of rows this adds up fast. Use dtype="string" or dtype="category" where appropriate.

Circular references preventing garbage collection

If you store a DataFrame inside a class and that class also stores a reference back to the processing context, you may create a reference cycle. Python's cycle collector handles most of these, but calling gc.collect() after your main processing phase ensures cycles are broken promptly.

Forgetting to close file handles

Always open files with a with block. An unclosed file handle won't cause a memory leak by itself, but in long-running scripts it can prevent the OS from reclaiming associated buffers and contributes to resource exhaustion under load.

Wrapping Up

Memory leaks in CSV processing scripts are usually structural, not mysterious. You load too much at once, hold references too long, or accumulate without flushing. The fixes are straightforward once you know where to look.

Here are your concrete next steps:

  1. Profile first. Run memory-profiler or tracemalloc on your script with a representative file before changing any code. Know which lines are responsible.
  2. Switch to chunked reads. Replace bare pd.read_csv(path) calls with a chunksize iterator and write results incrementally to disk.
  3. Specify dtypes explicitly. Add a dtype map to every read_csv call. Use "category" for low-cardinality string columns and smaller int/float types where your data allows.
  4. Delete large objects early. Use del and gc.collect() after each processing stage, especially inside loops.
  5. Benchmark before and after. Use psutil.Process().memory_info().rss at the start and end of your script to confirm you've actually reduced peak memory, not just shifted when it spikes.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.