Fixing Silently Stale Reads When Using Redis as a Write-Through Cache
Your write-through cache is supposed to keep Redis and your database in sync. You write to the database, you write to Redis simultaneously, and reads should always return fresh data. Except they don't β and the failure is quiet enough that your users notice before your monitoring does.
Stale reads in a write-through setup are one of those bugs that feel impossible until you understand the half-dozen ways they can actually happen. This article walks through every common cause and gives you code-level fixes you can apply today.
What you'll learn
- Why write-through caching can still produce stale reads despite the name
- How partial failures between your app, Redis, and your database cause silent divergence
- How to use TTLs, versioning, and atomic writes to close the gaps
- How to detect staleness before your users do
- Concrete patterns for making cache invalidation reliable under real-world conditions
Prerequisites
This article assumes you're running Redis as an application-level cache (not Redis as a primary store), that your application writes to a relational or document database, and that you have basic familiarity with Redis commands. Code examples use Python with the redis-py library, but the patterns translate to any language.
How Write-Through Caching Is Supposed to Work
In a write-through cache, every write goes to the cache and the database together before the write is considered complete. Reads then come from the cache, which should always reflect the latest state of the database. The theory is clean: no lazy invalidation, no stale reads.
In practice, "together" is doing a lot of work in that sentence. There is no built-in two-phase commit between your application, Redis, and a relational database. That gap is where stale reads hide.
Root Cause 1: Non-Atomic Dual Writes
The most common cause of silent staleness is a write that succeeds in one store and fails in the other, with no retry or rollback logic to clean up.
Consider this naive pattern:
def update_user(user_id, data):
db.execute("UPDATE users SET name=%s WHERE id=%s", (data["name"], user_id))
redis_client.set(f"user:{user_id}", json.dumps(data))
If the database write succeeds and the Redis write throws a ConnectionError, the cache is now stale. Future reads return the old value. The database and cache have diverged, and nothing logged anything meaningful about it.
The fix is to treat the Redis write as a required step, not an afterthought, and to handle failures explicitly:
import json
from redis.exceptions import RedisError
def update_user(user_id, data):
db.execute("UPDATE users SET name=%s WHERE id=%s", (data["name"], user_id))
db.commit()
try:
redis_client.set(
f"user:{user_id}",
json.dumps(data),
ex=300 # always set a TTL as a safety net
)
except RedisError:
# Log and enqueue a background job to retry the cache write
logger.error("Cache write failed for user:%s", user_id)
enqueue_cache_refresh("user", user_id)
This is still not atomic, but it makes the failure visible and recoverable. A background job can re-read from the database and repopulate the cache key.
Root Cause 2: Missing or Infinite TTLs
A Redis key without a TTL lives forever unless something explicitly deletes it. If your invalidation logic has a bug β or if a deployment changes the data format β you end up with stale keys that never expire.
Always set a TTL on every cache write, even in a write-through setup. The TTL is not your primary invalidation mechanism; it is your fallback when the primary mechanism fails.
# Bad: no TTL, key lives until Redis is flushed or runs out of memory
redis_client.set(f"user:{user_id}", json.dumps(data))
# Good: TTL acts as a safety net
redis_client.set(f"user:{user_id}", json.dumps(data), ex=600)
Pick a TTL that is short enough to limit blast radius but long enough to not thrash your database with reads. For most user-facing data, anywhere from 60 seconds to 10 minutes is a reasonable starting range depending on how frequently the data changes.
Root Cause 3: Race Conditions on Concurrent Writes
Two processes writing to the same cache key at the same time can produce a result where the older value wins. Process A writes version 5 to the database, then process B writes version 6 to the database. But B's cache write finishes first, then A's cache write overwrites it. The database has version 6; Redis has version 5.
You can guard against this using Redis's SET ... NX (set if not exists) or, better, by using a version number embedded in the key or value.
Option 1: Version the cache key
def update_user(user_id, data, version):
# Store with the version number so old writes target a different key
redis_client.set(
f"user:{user_id}:v{version}",
json.dumps(data),
ex=300
)
# Atomically point the canonical key to the new version
redis_client.set(f"user:{user_id}:current", version, ex=300)
Option 2: Use a Lua script for compare-and-set
Redis executes Lua scripts atomically, which lets you do a compare-and-set without a race window:
CAS_SCRIPT = """
local current = redis.call('HGET', KEYS[1], 'version')
if current == false or tonumber(current) < tonumber(ARGV[1]) then
redis.call('HSET', KEYS[1], 'data', ARGV[2], 'version', ARGV[1])
redis.call('EXPIRE', KEYS[1], ARGV[3])
return 1
end
return 0
"""
def update_user_atomic(user_id, data, version):
result = redis_client.eval(
CAS_SCRIPT,
1,
f"user:{user_id}",
version,
json.dumps(data),
300
)
if result == 0:
logger.info("Skipped stale write for user:%s at version %s", user_id, version)
The script only updates the cache if the incoming version is newer than what is already stored. An out-of-order write simply loses gracefully.
Root Cause 4: Cache Population After a Database Read
Some teams implement write-through by reading from the database and then populating Redis on every write. This is correct. But a subtle variant of this pattern introduces staleness: reading from the cache to populate the write payload, then writing that payload back.
# Dangerous pattern: reading from cache to construct the update
cached = json.loads(redis_client.get(f"user:{user_id}"))
cached["name"] = new_name
db.execute("UPDATE users SET data=%s WHERE id=%s", (json.dumps(cached), user_id))
redis_client.set(f"user:{user_id}", json.dumps(cached), ex=300)
If the cache was already stale before this write, you just committed stale data to the database too. Always construct your write payload from the authoritative source (the database or the validated request body), never from a cached value.
Root Cause 5: Replication Lag on Redis Replicas
If your application reads from Redis replicas and writes to the primary, replication lag can cause reads to return old data even when the primary is up to date. This is especially common in cloud-managed Redis setups where replicas are in different availability zones.
For data that must be strongly consistent, always read from the primary:
# redis-py: use the primary connection for reads that require freshness
primary_client = redis.Redis(host="redis-primary", port=6379)
def get_user_fresh(user_id):
raw = primary_client.get(f"user:{user_id}")
if raw:
return json.loads(raw)
return db.query("SELECT * FROM users WHERE id=%s", user_id)
For data where eventual consistency is acceptable, replicas are fine. The key is to be deliberate about which reads go where.
Root Cause 6: Eviction Silently Removes Keys
Redis evicts keys under memory pressure based on its configured maxmemory-policy. If your policy is allkeys-lru or allkeys-random, Redis can evict a key that your write-through logic just populated. The next read misses the cache, falls through to the database, and repopulates β which is correct behavior. But if the database is also stale for any reason (a failed transaction, a replica read), the repopulated value will be stale too.
Check your Redis configuration:
redis-cli CONFIG GET maxmemory-policy
For write-through caches where you need guarantees, volatile-lru (evict only keys with a TTL set) is generally safer than allkeys-lru. This way, explicitly permanent keys are not silently removed. Combine this with always setting TTLs so every key is eligible for controlled eviction.
Detecting Staleness Before Your Users Do
Passive fixes help, but you also want active detection. Two approaches work well in production.
Write a checksum into the cache value
Include a hash of the data and the write timestamp in every cached payload. A background process can periodically read a sample of cache keys, re-fetch the corresponding rows from the database, and compare:
import hashlib
import time
def build_cache_payload(data):
serialized = json.dumps(data, sort_keys=True)
checksum = hashlib.sha256(serialized.encode()).hexdigest()[:8]
return json.dumps({
"data": data,
"checksum": checksum,
"written_at": time.time()
})
def verify_cache_key(user_id):
raw = redis_client.get(f"user:{user_id}")
if not raw:
return # cache miss, not a staleness issue
cached = json.loads(raw)
db_row = db.query("SELECT * FROM users WHERE id=%s", user_id)
db_serialized = json.dumps(db_row, sort_keys=True)
db_checksum = hashlib.sha256(db_serialized.encode()).hexdigest()[:8]
if cached["checksum"] != db_checksum:
logger.warning("Stale cache detected for user:%s", user_id)
# Re-populate the cache from the database
redis_client.set(
f"user:{user_id}",
build_cache_payload(db_row),
ex=300
)
Track cache age in your metrics
Emit the written_at timestamp as a metric whenever you serve a cached read. If the age of served cache values is consistently higher than your expected maximum (your TTL), something is preventing normal cache refresh cycles and you want to know about it before it becomes a user complaint.
Common Pitfalls to Watch For
- Deleting instead of updating: If your invalidation path deletes the key rather than writing a fresh value, the next read will fall through to the database. That is correct, but it means your write-through guarantee only holds if the delete and the database write are also coordinated. A delete that succeeds while a database write is in-flight leaves you with a cache miss that repopulates stale data from a mid-transaction database state.
- Serialization format drift: If you change the shape of the JSON you store in Redis (add a field, rename a key) without flushing or versioning the existing keys, reads will deserialize old-format data and may silently drop fields or return wrong values.
- Multiple writers without coordination: Microservices that each manage their own cache writes for the same data entity will eventually conflict. Centralize cache writes for shared entities or use a shared cache-invalidation event bus.
- Using EXPIRE without resetting it on reads: If you extend the TTL on each cache read, a frequently-read but rarely-written key may never expire and eventually serve stale data indefinitely. Let TTLs count down; only reset them on writes.
Wrapping Up
Write-through caching reduces staleness compared to lazy invalidation, but it does not eliminate the problem. Every place where your app, Redis, and your database interact without atomicity is a potential divergence point.
Here are five concrete actions to take now:
- Audit every cache write in your codebase and confirm each one has an explicit TTL. Add one if it is missing.
- Add error handling around Redis writes so a connection failure triggers a background cache-refresh job rather than silently leaving a stale key.
- Check your Redis
maxmemory-policyand switch tovolatile-lruif you are not already using it. - Implement a version or checksum field in your cached payloads so you can detect divergence programmatically.
- Write a staleness-detection job that samples cache keys against the database on a schedule and alerts when divergence exceeds a threshold.
Staleness in a cache is almost always a consequence of an unhandled failure path, not a fundamental limitation of write-through caching. Once you know where to look, fixing it is methodical work.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!