Pandas read_sql Returning Stale or Mismatched Data: Connection and Query Pitfalls

June 03, 2026 9 min read 32 views
Flat illustration of a database cylinder linked to a code editor window with a clock overlay indicating stale data on a soft gradient background

You run pd.read_sql() and get a DataFrame that looks perfectly formatted β€” column names match, dtypes look sane, row count seems plausible. But the numbers are wrong. Or worse, you're looking at data from three hours ago and you don't know it yet. This kind of silent failure is far more common than an outright exception, which makes it genuinely dangerous in production pipelines.

The root causes almost always trace back to connection lifecycle issues, transaction isolation, or subtle query parameter mistakes. None of them raise an error; they just quietly return the wrong answer.

What you'll learn

  • Why a reused or stale connection can return cached or uncommitted data
  • How SQLAlchemy connection pooling interacts with read_sql
  • The difference between passing a raw DBAPI connection versus an engine
  • How transaction isolation levels cause you to read old rows
  • Practical patterns for reliable, repeatable SQL reads in pandas

Prerequisites

This article assumes you're working with pandas 1.3 or later and SQLAlchemy 1.4 or 2.x. The examples use PostgreSQL, but the same principles apply to MySQL, SQLite, and most other relational databases. You should be comfortable writing basic SQL and have a working database connection already configured.

How read_sql Actually Works

pd.read_sql() is a thin wrapper. You hand it a SQL string (or a SQLAlchemy text object) and a connection, and it executes the query, fetches all rows, and stuffs them into a DataFrame. That simplicity is part of the appeal, but it also means pandas does almost no connection management on your behalf.

Under the hood, pandas calls read_sql_query or read_sql_table depending on what you pass. The connection object you provide is used as-is. Pandas does not open a fresh connection for every call, does not automatically commit or rollback, and does not check whether the connection is healthy before using it.

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql+psycopg2://user:pass@localhost/mydb")

df = pd.read_sql("SELECT * FROM orders WHERE status = 'pending'", engine)
print(df.shape)

This works fine in a simple script. The problems start when that engine or connection object gets reused across requests, scheduled jobs, or long-running processes.

Stale Connections and Connection Pooling

SQLAlchemy's connection pool keeps a set of open database connections and hands them out on demand. By default, it uses a QueuePool that recycles connections rather than closing and reopening them. That's great for performance. It's a trap for data freshness.

If a connection in the pool has an open transaction β€” or if the database server has recycled the connection on its end without SQLAlchemy knowing β€” you can end up reading from a snapshot that doesn't reflect the current state of the table.

A particularly common scenario: another process inserts rows and commits. Your process reuses a pooled connection that still has its transaction isolation snapshot from before that commit. You query the table. You get the old data. No error, no warning.

# Risky: reusing the same engine-level connection across multiple reads
with engine.connect() as conn:
    df1 = pd.read_sql("SELECT count(*) FROM events", conn)
    # ... some time passes, other processes write data ...
    df2 = pd.read_sql("SELECT count(*) FROM events", conn)
    # df1 and df2 may return the same count due to transaction snapshot

The fix is to ensure each read happens in a fresh connection context, especially when you need to see the latest committed data.

# Better: each read_sql call gets its own connection from the pool
def fetch_pending_orders(engine):
    with engine.connect() as conn:
        return pd.read_sql(
            "SELECT * FROM orders WHERE status = 'pending'",
            conn
        )

# Call this function each time you need fresh data
df = fetch_pending_orders(engine)

Opening a new with engine.connect() block for each query ensures SQLAlchemy checks out a connection, executes the query, and returns the connection to the pool β€” cleanly closing any implicit transaction in the process.

Passing a Raw DBAPI Connection vs. an Engine

Pandas accepts both a SQLAlchemy engine and a raw DBAPI connection (the kind you'd get directly from psycopg2.connect() or sqlite3.connect()). The behavior differs in ways that bite people.

With a raw DBAPI connection, pandas does not manage the transaction for you. If that connection is sitting inside an uncommitted transaction from earlier in your code, your read_sql call reads from that transaction's snapshot. You're not seeing the committed state of the database β€” you're seeing what your own transaction has accumulated so far, plus whatever was committed before your transaction started.

import psycopg2

# Direct DBAPI connection β€” you own the transaction lifecycle
raw_conn = psycopg2.connect("dbname=mydb user=user password=pass host=localhost")

# If there's an open transaction here, read_sql sees its snapshot
df = pd.read_sql("SELECT * FROM orders", raw_conn)

# You must manage commit/rollback yourself
raw_conn.rollback()  # or raw_conn.commit()
raw_conn.close()

With a SQLAlchemy engine or Connection object, SQLAlchemy handles more of this lifecycle. Prefer engines in production code unless you have a specific reason to manage a DBAPI connection directly.

Transaction Isolation Level Mismatches

Most relational databases default to Read Committed isolation, where a query sees all data committed before the query began. But some configurations β€” especially in long-running analytics connections or when you've explicitly started a REPEATABLE READ or SERIALIZABLE transaction β€” mean your reads are pinned to an older snapshot of the database.

PostgreSQL in particular starts a new transaction implicitly the first time you execute a statement on a connection. If you never commit or roll back, subsequent reads on that same connection keep seeing the same snapshot, even as other transactions commit new data around you.

# Check your isolation level when debugging stale reads
with engine.connect() as conn:
    result = conn.execute("SHOW transaction_isolation")
    print(result.fetchone())

If you're doing scheduled reads inside a long-running service and you keep seeing the same data despite rows being inserted elsewhere, check whether your connection is holding an open transaction. The cleanest fix is to use a fresh connection for each read, or to explicitly call BEGIN and COMMIT around your read to force a new snapshot.

from sqlalchemy import text

with engine.begin() as conn:  # engine.begin() auto-commits on block exit
    df = pd.read_sql(text("SELECT * FROM orders WHERE status = 'pending'"), conn)

Using engine.begin() instead of engine.connect() wraps the block in an explicit transaction that commits (or rolls back on exception) when the block exits. This guarantees you start each read with a fresh transaction scope.

Parameterized Queries and Silent Type Coercion

Mismatched data often comes from a subtler source: incorrect query parameters. When you pass parameters to read_sql, the syntax depends on your backend, and getting it wrong doesn't always raise an exception β€” sometimes it just silently ignores your filter.

# Wrong: string formatting (also a SQL injection risk)
status = 'pending'
df = pd.read_sql(f"SELECT * FROM orders WHERE status = '{status}'", engine)

# Wrong: using %s without a DBAPI cursor (depends on dialect)
df = pd.read_sql(
    "SELECT * FROM orders WHERE status = %s",
    engine,
    params=['pending']  # May silently fail with some backends
)

# Right: use SQLAlchemy text() with named parameters
from sqlalchemy import text

with engine.connect() as conn:
    df = pd.read_sql(
        text("SELECT * FROM orders WHERE status = :status"),
        conn,
        params={"status": "pending"}
    )

The text() approach is dialect-independent and properly handles type coercion. When you use raw string formatting and a parameter has the wrong type β€” say, you're passing an integer where the column expects a string β€” the database might cast silently, match no rows, or return a broader result set than intended.

Schema and Column Name Drift

Another class of mismatched data: your DataFrame has the right shape but the columns mean something different from what you expect. This happens when the underlying table schema changes but your query or downstream code doesn't.

If you're using SELECT *, a column added or reordered in the table will silently change your DataFrame's structure. Code that accesses columns by position (rather than name) will start reading the wrong data without error.

# Fragile: column order depends on table definition
df = pd.read_sql("SELECT * FROM orders", engine)
price = df.iloc[:, 3]  # This breaks if a column is inserted before index 3

# Robust: name the columns you need explicitly
df = pd.read_sql(
    "SELECT order_id, customer_id, total_price, status FROM orders",
    engine
)
price = df['total_price']  # This is stable across schema changes

Explicit column selection also makes queries faster β€” the database sends less data over the wire β€” and makes your intent clear to anyone reading the code later.

Connection Recycling in Scheduled Jobs

Scheduled pipelines that create an engine once at startup and reuse it for hours or days are particularly vulnerable. Database servers have their own timeout settings; they'll close idle connections after a period of inactivity. SQLAlchemy's pool doesn't always know about this, so it hands out what it thinks is a live connection that the server has already dropped.

SQLAlchemy provides a pool_pre_ping option precisely for this situation. When enabled, it issues a lightweight check before using a pooled connection, discards it if the server has closed it, and opens a fresh one.

engine = create_engine(
    "postgresql+psycopg2://user:pass@localhost/mydb",
    pool_pre_ping=True,       # Test connections before use
    pool_recycle=3600,        # Recycle connections every hour
    pool_size=5,
    max_overflow=10
)

Set pool_recycle to a value shorter than your database server's idle connection timeout. Check your server's configuration β€” for MySQL this is wait_timeout, for PostgreSQL it's set at the system or role level.

Common Pitfalls at a Glance

PitfallSymptomFix
Reused connection with open transactionReads return old snapshotUse a fresh engine.connect() per read
Raw DBAPI connection with dirty stateMismatched or partial dataRollback before reading, or use SQLAlchemy engine
Repeatable Read isolationStale data despite new commitsUse engine.begin() to force new transaction scope
Bad parameter syntaxWrong filter, too many or too few rowsUse text() with named parameters
SELECT * with schema changesWrong values in positional accessAlways name your columns explicitly
Stale pooled connection after timeoutOperationalError or silent reconnect to stale stateEnable pool_pre_ping=True, set pool_recycle

Debugging a Suspected Stale Read

When you suspect read_sql is returning old data, start by confirming what the database actually holds. Run the same query directly in a database client like DBeaver or psql and compare the result to your DataFrame. If the counts differ, the problem is in your connection or transaction state, not the query itself.

Next, add a SELECT now() or SELECT current_timestamp to your read to check when the database thinks the query ran. If it returns a time from the past, your connection is serving a cached or replicated result.

from sqlalchemy import text

with engine.connect() as conn:
    ts = conn.execute(text("SELECT now()")).scalar()
    print("DB time:", ts)
    df = pd.read_sql(text("SELECT * FROM orders WHERE status = 'pending'"), conn)
    print("Row count:", len(df))

If SELECT now() returns the current time but your data is still old, the issue is transaction isolation. If now() itself looks off, you may be hitting a read replica that's lagging behind the primary β€” a separate problem, but an important one to rule out.

Wrapping Up

Stale or mismatched data from read_sql is almost never a pandas bug. It's a connection lifecycle and transaction management problem, and it's fixable once you know where to look. Here are concrete steps to take right now:

  • Audit your engine creation code and add pool_pre_ping=True and a sensible pool_recycle value to any engine used in a long-running process.
  • Switch from raw DBAPI connections to SQLAlchemy engine connections for all read_sql calls. Let SQLAlchemy manage the transaction lifecycle.
  • Replace SELECT * with explicit column lists in any query that feeds downstream code relying on specific column names or positions.
  • Use text() with named parameters for any parameterized query instead of string formatting or positional %s placeholders.
  • Add a SELECT now() probe to your debugging toolkit whenever data looks suspicious β€” it takes 30 seconds and immediately tells you whether you're looking at a lagging connection or a real data absence.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.