Pandas read_sql Returning Stale or Mismatched Data: Connection and Query Pitfalls
You run pd.read_sql() and get a DataFrame that looks perfectly formatted β column names match, dtypes look sane, row count seems plausible. But the numbers are wrong. Or worse, you're looking at data from three hours ago and you don't know it yet. This kind of silent failure is far more common than an outright exception, which makes it genuinely dangerous in production pipelines.
The root causes almost always trace back to connection lifecycle issues, transaction isolation, or subtle query parameter mistakes. None of them raise an error; they just quietly return the wrong answer.
What you'll learn
- Why a reused or stale connection can return cached or uncommitted data
- How SQLAlchemy connection pooling interacts with
read_sql - The difference between passing a raw
DBAPIconnection versus an engine - How transaction isolation levels cause you to read old rows
- Practical patterns for reliable, repeatable SQL reads in pandas
Prerequisites
This article assumes you're working with pandas 1.3 or later and SQLAlchemy 1.4 or 2.x. The examples use PostgreSQL, but the same principles apply to MySQL, SQLite, and most other relational databases. You should be comfortable writing basic SQL and have a working database connection already configured.
How read_sql Actually Works
pd.read_sql() is a thin wrapper. You hand it a SQL string (or a SQLAlchemy text object) and a connection, and it executes the query, fetches all rows, and stuffs them into a DataFrame. That simplicity is part of the appeal, but it also means pandas does almost no connection management on your behalf.
Under the hood, pandas calls read_sql_query or read_sql_table depending on what you pass. The connection object you provide is used as-is. Pandas does not open a fresh connection for every call, does not automatically commit or rollback, and does not check whether the connection is healthy before using it.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://user:pass@localhost/mydb")
df = pd.read_sql("SELECT * FROM orders WHERE status = 'pending'", engine)
print(df.shape)
This works fine in a simple script. The problems start when that engine or connection object gets reused across requests, scheduled jobs, or long-running processes.
Stale Connections and Connection Pooling
SQLAlchemy's connection pool keeps a set of open database connections and hands them out on demand. By default, it uses a QueuePool that recycles connections rather than closing and reopening them. That's great for performance. It's a trap for data freshness.
If a connection in the pool has an open transaction β or if the database server has recycled the connection on its end without SQLAlchemy knowing β you can end up reading from a snapshot that doesn't reflect the current state of the table.
A particularly common scenario: another process inserts rows and commits. Your process reuses a pooled connection that still has its transaction isolation snapshot from before that commit. You query the table. You get the old data. No error, no warning.
# Risky: reusing the same engine-level connection across multiple reads
with engine.connect() as conn:
df1 = pd.read_sql("SELECT count(*) FROM events", conn)
# ... some time passes, other processes write data ...
df2 = pd.read_sql("SELECT count(*) FROM events", conn)
# df1 and df2 may return the same count due to transaction snapshot
The fix is to ensure each read happens in a fresh connection context, especially when you need to see the latest committed data.
# Better: each read_sql call gets its own connection from the pool
def fetch_pending_orders(engine):
with engine.connect() as conn:
return pd.read_sql(
"SELECT * FROM orders WHERE status = 'pending'",
conn
)
# Call this function each time you need fresh data
df = fetch_pending_orders(engine)
Opening a new with engine.connect() block for each query ensures SQLAlchemy checks out a connection, executes the query, and returns the connection to the pool β cleanly closing any implicit transaction in the process.
Passing a Raw DBAPI Connection vs. an Engine
Pandas accepts both a SQLAlchemy engine and a raw DBAPI connection (the kind you'd get directly from psycopg2.connect() or sqlite3.connect()). The behavior differs in ways that bite people.
With a raw DBAPI connection, pandas does not manage the transaction for you. If that connection is sitting inside an uncommitted transaction from earlier in your code, your read_sql call reads from that transaction's snapshot. You're not seeing the committed state of the database β you're seeing what your own transaction has accumulated so far, plus whatever was committed before your transaction started.
import psycopg2
# Direct DBAPI connection β you own the transaction lifecycle
raw_conn = psycopg2.connect("dbname=mydb user=user password=pass host=localhost")
# If there's an open transaction here, read_sql sees its snapshot
df = pd.read_sql("SELECT * FROM orders", raw_conn)
# You must manage commit/rollback yourself
raw_conn.rollback() # or raw_conn.commit()
raw_conn.close()
With a SQLAlchemy engine or Connection object, SQLAlchemy handles more of this lifecycle. Prefer engines in production code unless you have a specific reason to manage a DBAPI connection directly.
Transaction Isolation Level Mismatches
Most relational databases default to Read Committed isolation, where a query sees all data committed before the query began. But some configurations β especially in long-running analytics connections or when you've explicitly started a REPEATABLE READ or SERIALIZABLE transaction β mean your reads are pinned to an older snapshot of the database.
PostgreSQL in particular starts a new transaction implicitly the first time you execute a statement on a connection. If you never commit or roll back, subsequent reads on that same connection keep seeing the same snapshot, even as other transactions commit new data around you.
# Check your isolation level when debugging stale reads
with engine.connect() as conn:
result = conn.execute("SHOW transaction_isolation")
print(result.fetchone())
If you're doing scheduled reads inside a long-running service and you keep seeing the same data despite rows being inserted elsewhere, check whether your connection is holding an open transaction. The cleanest fix is to use a fresh connection for each read, or to explicitly call BEGIN and COMMIT around your read to force a new snapshot.
from sqlalchemy import text
with engine.begin() as conn: # engine.begin() auto-commits on block exit
df = pd.read_sql(text("SELECT * FROM orders WHERE status = 'pending'"), conn)
Using engine.begin() instead of engine.connect() wraps the block in an explicit transaction that commits (or rolls back on exception) when the block exits. This guarantees you start each read with a fresh transaction scope.
Parameterized Queries and Silent Type Coercion
Mismatched data often comes from a subtler source: incorrect query parameters. When you pass parameters to read_sql, the syntax depends on your backend, and getting it wrong doesn't always raise an exception β sometimes it just silently ignores your filter.
# Wrong: string formatting (also a SQL injection risk)
status = 'pending'
df = pd.read_sql(f"SELECT * FROM orders WHERE status = '{status}'", engine)
# Wrong: using %s without a DBAPI cursor (depends on dialect)
df = pd.read_sql(
"SELECT * FROM orders WHERE status = %s",
engine,
params=['pending'] # May silently fail with some backends
)
# Right: use SQLAlchemy text() with named parameters
from sqlalchemy import text
with engine.connect() as conn:
df = pd.read_sql(
text("SELECT * FROM orders WHERE status = :status"),
conn,
params={"status": "pending"}
)
The text() approach is dialect-independent and properly handles type coercion. When you use raw string formatting and a parameter has the wrong type β say, you're passing an integer where the column expects a string β the database might cast silently, match no rows, or return a broader result set than intended.
Schema and Column Name Drift
Another class of mismatched data: your DataFrame has the right shape but the columns mean something different from what you expect. This happens when the underlying table schema changes but your query or downstream code doesn't.
If you're using SELECT *, a column added or reordered in the table will silently change your DataFrame's structure. Code that accesses columns by position (rather than name) will start reading the wrong data without error.
# Fragile: column order depends on table definition
df = pd.read_sql("SELECT * FROM orders", engine)
price = df.iloc[:, 3] # This breaks if a column is inserted before index 3
# Robust: name the columns you need explicitly
df = pd.read_sql(
"SELECT order_id, customer_id, total_price, status FROM orders",
engine
)
price = df['total_price'] # This is stable across schema changes
Explicit column selection also makes queries faster β the database sends less data over the wire β and makes your intent clear to anyone reading the code later.
Connection Recycling in Scheduled Jobs
Scheduled pipelines that create an engine once at startup and reuse it for hours or days are particularly vulnerable. Database servers have their own timeout settings; they'll close idle connections after a period of inactivity. SQLAlchemy's pool doesn't always know about this, so it hands out what it thinks is a live connection that the server has already dropped.
SQLAlchemy provides a pool_pre_ping option precisely for this situation. When enabled, it issues a lightweight check before using a pooled connection, discards it if the server has closed it, and opens a fresh one.
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/mydb",
pool_pre_ping=True, # Test connections before use
pool_recycle=3600, # Recycle connections every hour
pool_size=5,
max_overflow=10
)
Set pool_recycle to a value shorter than your database server's idle connection timeout. Check your server's configuration β for MySQL this is wait_timeout, for PostgreSQL it's set at the system or role level.
Common Pitfalls at a Glance
| Pitfall | Symptom | Fix |
|---|---|---|
| Reused connection with open transaction | Reads return old snapshot | Use a fresh engine.connect() per read |
| Raw DBAPI connection with dirty state | Mismatched or partial data | Rollback before reading, or use SQLAlchemy engine |
| Repeatable Read isolation | Stale data despite new commits | Use engine.begin() to force new transaction scope |
| Bad parameter syntax | Wrong filter, too many or too few rows | Use text() with named parameters |
| SELECT * with schema changes | Wrong values in positional access | Always name your columns explicitly |
| Stale pooled connection after timeout | OperationalError or silent reconnect to stale state | Enable pool_pre_ping=True, set pool_recycle |
Debugging a Suspected Stale Read
When you suspect read_sql is returning old data, start by confirming what the database actually holds. Run the same query directly in a database client like DBeaver or psql and compare the result to your DataFrame. If the counts differ, the problem is in your connection or transaction state, not the query itself.
Next, add a SELECT now() or SELECT current_timestamp to your read to check when the database thinks the query ran. If it returns a time from the past, your connection is serving a cached or replicated result.
from sqlalchemy import text
with engine.connect() as conn:
ts = conn.execute(text("SELECT now()")).scalar()
print("DB time:", ts)
df = pd.read_sql(text("SELECT * FROM orders WHERE status = 'pending'"), conn)
print("Row count:", len(df))
If SELECT now() returns the current time but your data is still old, the issue is transaction isolation. If now() itself looks off, you may be hitting a read replica that's lagging behind the primary β a separate problem, but an important one to rule out.
Wrapping Up
Stale or mismatched data from read_sql is almost never a pandas bug. It's a connection lifecycle and transaction management problem, and it's fixable once you know where to look. Here are concrete steps to take right now:
- Audit your engine creation code and add
pool_pre_ping=Trueand a sensiblepool_recyclevalue to any engine used in a long-running process. - Switch from raw DBAPI connections to SQLAlchemy engine connections for all
read_sqlcalls. Let SQLAlchemy manage the transaction lifecycle. - Replace
SELECT *with explicit column lists in any query that feeds downstream code relying on specific column names or positions. - Use
text()with named parameters for any parameterized query instead of string formatting or positional%splaceholders. - Add a
SELECT now()probe to your debugging toolkit whenever data looks suspicious β it takes 30 seconds and immediately tells you whether you're looking at a lagging connection or a real data absence.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!