Fixing Python Script Memory Leaks When Processing Large CSV Files
Your Python script runs fine on a 10 MB test file, but when you point it at the real 3 GB production CSV, memory climbs until the process gets killed. This is one of the most common β and most fixable β problems in data processing scripts. The root cause is almost always the same: you're holding more data in memory than you need to at any given moment.
This guide walks you through diagnosing where memory is going, then gives you practical patterns to stop the leak before it starts.
What you'll learn
- How to profile memory usage in a running Python script
- Why
pandas.read_csv()can silently consume far more RAM than your file size suggests - How to use chunking and generators to process rows without loading everything at once
- Common pitfalls that cause objects to stay alive longer than expected
- Practical patterns for writing memory-safe CSV pipelines
Prerequisites
You should be comfortable writing Python functions and have used pandas at least once. Code examples use Python 3.9+ and pandas 1.5 or later, though the core ideas apply to earlier versions too. Install memory-profiler and tracemalloc (built into the standard library) before you start.
pip install memory-profiler pandas psutilWhy CSV Processing Leaks Memory
When you call pd.read_csv("big_file.csv"), pandas reads every row into a DataFrame that lives in memory for as long as your script holds a reference to it. A 1 GB CSV can expand to 4β6 GB in memory because pandas stores columns as typed arrays with overhead, infers dtypes, and sometimes creates intermediate copies during operations.
The leak isn't always a single call. It compounds: you read a file, transform it, store the result, and never delete the original. Now you have two full copies in RAM. Add a loop that processes batches without clearing previous batches, and memory climbs with every iteration.
Three patterns cause the vast majority of these leaks:
- Loading the entire file at once when you only need one row at a time
- Accumulating results in a list inside a loop without bounding the list's size
- Holding references to DataFrames in long-lived objects like class attributes or module-level variables
Profiling Before You Fix Anything
Guessing where memory goes is slower than measuring. Python gives you two good built-in tools: tracemalloc for tracking allocations and memory-profiler for line-by-line reporting.
Using tracemalloc
Wrap your processing code with tracemalloc snapshots to see which lines allocated the most memory.
import tracemalloc
import pandas as pd
tracemalloc.start()
df = pd.read_csv("large_file.csv")
result = df[df["value"] > 100]
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)The output shows filename, line number, and total bytes allocated. Look for lines that show unexpectedly large allocations β those are your targets.
Using memory-profiler
Add the @profile decorator to any function and run it with mprof run. You get a report showing memory at each line of that function.
from memory_profiler import profile
import pandas as pd
@profile
def process_file(path):
df = pd.read_csv(path)
total = df["amount"].sum()
return total
process_file("large_file.csv")python -m memory_profiler your_script.pyRun this on a representative file. The per-line memory delta column tells you exactly which operations are expensive.
Pattern 1: Read in Chunks Instead of All at Once
The simplest fix for a whole-file read is chunking. pandas.read_csv() accepts a chunksize parameter that returns an iterator of smaller DataFrames instead of one giant one.
import pandas as pd
chunk_size = 50_000 # rows per chunk
results = []
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
# Process only this chunk
filtered = chunk[chunk["status"] == "active"]
results.append(filtered[["id", "amount"]].copy())
final_df = pd.concat(results, ignore_index=True)Notice the .copy() on the filtered result. Without it, the filtered slice holds a reference to the full chunk DataFrame, which keeps the entire chunk alive in memory even after the loop variable moves on.
The right chunk size depends on your machine's available RAM and your row width. Start at 50,000 rows and adjust. You want chunks large enough to keep I/O efficient but small enough that you never hold more than a few hundred MB at once.
Pattern 2: Use a Generator for Row-by-Row Processing
If you don't need pandas features at all β you're just reading, transforming, and writing rows β the built-in csv module with a generator is the most memory-efficient option. Only one row exists in memory at a time.
import csv
def read_rows(filepath):
with open(filepath, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
yield row
def process_rows(filepath, output_path):
with open(output_path, "w", newline="", encoding="utf-8") as out:
writer = None
for row in read_rows(filepath):
# Transform the row however you need
row["amount"] = float(row["amount"]) * 1.1
if writer is None:
writer = csv.DictWriter(out, fieldnames=row.keys())
writer.writeheader()
writer.writerow(row)
process_rows("large_file.csv", "output.csv")This pattern scales to files of any size because your memory footprint is constant: one row in, one row out. The tradeoff is that you lose pandas' vectorized operations, so it's slower on computationally heavy transformations.
Pattern 3: Reduce DataFrame Memory with Dtype Optimization
Sometimes you need pandas and you need the whole dataset. In that case, reduce the memory footprint by specifying dtypes explicitly instead of letting pandas infer them.
Pandas will store a column of small integers as int64 by default. If your values never exceed 255, you can use uint8 and use eight times less memory for that column.
import pandas as pd
dtype_map = {
"user_id": "int32",
"status_code": "uint8",
"category": "category", # string columns with few unique values
"amount": "float32",
"description": "string", # uses pandas StringDtype, more efficient than object
}
df = pd.read_csv("large_file.csv", dtype=dtype_map)
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")The "category" dtype is especially effective for string columns where the same values repeat frequently (status labels, country codes, product categories). Instead of storing each string separately, pandas stores an integer code and a lookup table.
Pattern 4: Delete What You No Longer Need
Python's garbage collector frees objects when no references to them remain. The problem is that references linger in ways that aren't obvious.
import pandas as pd
import gc
def process_in_stages(path):
# Stage 1: filter
df = pd.read_csv(path)
filtered = df[df["active"] == 1].copy()
del df # drop the original immediately
gc.collect() # force collection; useful after large deletions
# Stage 2: aggregate
summary = filtered.groupby("region")["amount"].sum().reset_index()
del filtered
gc.collect()
return summaryUse del as soon as you're done with a large object, then call gc.collect() to reclaim the memory immediately rather than waiting for Python's scheduled collection cycle. This matters most inside loops.
Common Pitfalls
Accumulating chunks in a list without limit
Chunking only helps if you process and discard each chunk. If you append every filtered chunk to a list and only concatenate at the end, you still end up with the entire dataset in memory β plus the overhead of the list and temporary DataFrames during concat.
Fix this by writing processed chunks to disk incrementally and doing a final read only if you need the full result for a subsequent step.
import pandas as pd
output_path = "processed.csv"
header_written = False
for chunk in pd.read_csv("large_file.csv", chunksize=50_000):
processed = chunk[chunk["status"] == "active"].copy()
processed.to_csv(
output_path,
mode="a",
header=not header_written,
index=False
)
header_written = TrueString columns loaded as object dtype
Pandas loads text columns as object dtype by default. An object column is essentially a Python list of Python strings β each string is a separate heap allocation with its own reference count overhead. On a column with millions of rows this adds up fast. Use dtype="string" or dtype="category" where appropriate.
Circular references preventing garbage collection
If you store a DataFrame inside a class and that class also stores a reference back to the processing context, you may create a reference cycle. Python's cycle collector handles most of these, but calling gc.collect() after your main processing phase ensures cycles are broken promptly.
Forgetting to close file handles
Always open files with a with block. An unclosed file handle won't cause a memory leak by itself, but in long-running scripts it can prevent the OS from reclaiming associated buffers and contributes to resource exhaustion under load.
Wrapping Up
Memory leaks in CSV processing scripts are usually structural, not mysterious. You load too much at once, hold references too long, or accumulate without flushing. The fixes are straightforward once you know where to look.
Here are your concrete next steps:
- Profile first. Run
memory-profilerortracemallocon your script with a representative file before changing any code. Know which lines are responsible. - Switch to chunked reads. Replace bare
pd.read_csv(path)calls with achunksizeiterator and write results incrementally to disk. - Specify dtypes explicitly. Add a
dtypemap to everyread_csvcall. Use"category"for low-cardinality string columns and smaller int/float types where your data allows. - Delete large objects early. Use
delandgc.collect()after each processing stage, especially inside loops. - Benchmark before and after. Use
psutil.Process().memory_info().rssat the start and end of your script to confirm you've actually reduced peak memory, not just shifted when it spikes.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!