Fixing Redis Cache Stampede When Multiple Workers Hit Expired Keys
Your Redis cache is humming along fine β until a key expires at the wrong moment. Suddenly ten, fifty, or a hundred workers all miss the cache simultaneously and each one fires a full database query to rebuild the value. Your database melts, latency spikes, and users see errors. This is the cache stampede problem, and it is more common than most teams expect.
What you'll learn
- Why cache stampede happens and how to spot it in production
- How to use distributed locking to serialize cache rebuilds
- How probabilistic early expiry eliminates the problem at the source
- How
SETNX-based patterns and Lua scripts keep things atomic - Practical trade-offs between each approach so you can pick the right one
Prerequisites
You should be comfortable writing Python or reading it closely enough to translate examples to your own language. The code samples use the redis-py library. You'll need Redis 6 or later running locally or in a staging environment to test these patterns yourself.
Why Stampede Happens
Every cached value has a TTL. The moment that TTL reaches zero, Redis evicts the key. If your traffic is low, one worker rebuilds the value and sets it before the next request arrives β no problem. But under load, dozens of concurrent requests can all call GET mykey, all receive nil, and all immediately start querying the database to compute a fresh value.
The rebuild itself is usually expensive β a complex SQL join, an external API call, a heavy aggregation. Multiplying that cost by 50 simultaneous workers turns a manageable operation into a thundering herd. The database buckles, Redis gets a flood of SET calls for the same key a millisecond apart, and latency climbs for every user on the platform.
This is sometimes called the dog-pile effect or thundering herd. All three names describe the same race condition: many processes doing redundant work because a shared signal (the cached value) disappeared at the same instant.
Detecting Stampede in Production
Before you fix anything, confirm you're actually looking at a stampede rather than a different bottleneck. Two signals give it away quickly.
Database query spikes tied to cache misses
Graph your database slow-query count alongside your Redis keyspace miss rate (keyspace_misses from INFO stats). During a stampede you'll see both lines spike at exactly the same moment, and the DB spike will look disproportionate β more queries than you'd expect for the traffic volume.
Redis INFO stats
redis-cli INFO stats | grep -E 'keyspace_hits|keyspace_misses'
A healthy cache keeps the miss rate well below five percent during steady-state traffic. A sudden jump to 30β80 percent during a key-expiry event is a strong indicator of stampede. Pair this with APM traces showing many identical downstream queries executing in the same 100ms window and the diagnosis is confirmed.
Strategy 1: Distributed Locking (Mutex on Rebuild)
The most direct fix is to make only one worker responsible for rebuilding a given key. Every other worker that misses the cache waits for the lock holder to finish, then reads the freshly cached value.
The pattern looks like this: try to acquire a Redis lock before starting the expensive rebuild. If you get the lock, rebuild and release. If you don't get the lock, wait briefly and retry the cache read.
import time
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
LOCK_TIMEOUT = 10 # seconds the lock is held at most
WAIT_INTERVAL = 0.05 # 50ms between retries
MAX_WAIT = 5.0 # give up waiting after 5 seconds
def get_with_lock(cache_key: str, rebuild_fn, ttl: int = 300) -> str:
value = r.get(cache_key)
if value is not None:
return value
lock_key = f'lock:{cache_key}'
acquired = r.set(lock_key, '1', nx=True, ex=LOCK_TIMEOUT)
if acquired:
try:
value = rebuild_fn()
r.set(cache_key, value, ex=ttl)
finally:
r.delete(lock_key)
return value
# Another worker holds the lock β wait for the cache to be populated.
waited = 0.0
while waited < MAX_WAIT:
time.sleep(WAIT_INTERVAL)
waited += WAIT_INTERVAL
value = r.get(cache_key)
if value is not None:
return value
# Fallback: if the lock holder died, do the rebuild ourselves.
return rebuild_fn()
The nx=True flag on SET makes the lock acquisition atomic β only one caller gets True back. The ex=LOCK_TIMEOUT ensures the lock expires automatically if the worker crashes mid-rebuild, preventing a permanent lockout.
When this approach fits
Distributed locking works well when the rebuild is slow and you genuinely want to block redundant work. It adds latency for waiting workers, so it is less suitable when your rebuild is fast and slightly stale data is acceptable.
Strategy 2: Probabilistic Early Expiry (PER)
Locking introduces coordination overhead. Probabilistic early expiry avoids the race entirely by refreshing the value before it expires, based on a random roll that becomes more likely as the key approaches its TTL deadline.
The idea comes from a well-known caching paper. Each worker that reads the key evaluates a small formula: should I preemptively rebuild now, even though the key is still valid? The probability of answering yes increases as expiry gets closer. By the time the key actually expires, the cache has almost certainly already been refreshed by one early worker.
import math
import random
import time
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
BETA = 1.0 # controls how aggressively to pre-fetch; 1.0 is the standard default
def get_with_per(cache_key: str, rebuild_fn, ttl: int = 300) -> str:
"""
Each cached entry is stored alongside its recompute time and TTL
so workers can evaluate the early-expiry formula.
"""
import json
raw = r.get(cache_key)
if raw:
entry = json.loads(raw)
expiry_time = entry['stored_at'] + entry['ttl']
recompute_time = entry['recompute_time']
remaining = expiry_time - time.time()
# XFetch formula: refresh early if this condition holds.
if remaining - BETA * recompute_time * math.log(random.random()) > 0:
return entry['value']
# Otherwise fall through to rebuild.
start = time.time()
value = rebuild_fn()
recompute_time = time.time() - start
import json
entry = json.dumps({
'value': value,
'stored_at': time.time(),
'ttl': ttl,
'recompute_time': recompute_time,
})
r.set(cache_key, entry, ex=ttl)
return value
The math.log(random.random()) term produces a negative number that grows in magnitude with each worker call, pushing the condition toward triggering a rebuild sooner. Tune BETA upward to make early refreshes happen more aggressively, or downward to make them rarer.
When this approach fits
PER is excellent when slight read overhead is acceptable and you want zero coordination between workers. It degrades gracefully: if no worker happens to trigger an early rebuild, the worst case is a single expiry miss β far better than a stampede.
Strategy 3: Stale-While-Revalidate with a Background Worker
A third pattern is to serve stale data immediately on a miss and kick off an asynchronous rebuild in the background. Workers never block; they always return something, even if it is slightly outdated.
import threading
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
def get_stale_while_revalidate(
cache_key: str,
stale_key: str,
rebuild_fn,
ttl: int = 300,
stale_ttl: int = 600,
) -> str:
value = r.get(cache_key)
if value is not None:
return value
# Primary key is gone β serve stale data and trigger async rebuild.
stale = r.get(stale_key)
rebuild_lock = f'rebuilding:{cache_key}'
already_rebuilding = r.set(rebuild_lock, '1', nx=True, ex=ttl + 10)
if already_rebuilding:
def background_rebuild():
try:
fresh = rebuild_fn()
r.set(cache_key, fresh, ex=ttl)
r.set(stale_key, fresh, ex=stale_ttl)
finally:
r.delete(rebuild_lock)
threading.Thread(target=background_rebuild, daemon=True).start()
if stale is not None:
return stale
# No stale data exists at all β must block and wait.
return rebuild_fn()
Here you maintain two keys: a primary key with a short TTL and a stale key with a longer TTL. When the primary expires, requests immediately get the stale value while one background thread refreshes the primary. By the time another TTL cycle completes, fresh data is back in place.
Choosing a Strategy
Each approach makes a different trade-off between complexity, latency, and data freshness. Here's a quick comparison.
| Strategy | Data freshness | Added latency on miss | Complexity |
|---|---|---|---|
| Distributed lock | Immediate on unlock | High (waiting workers block) | Medium |
| Probabilistic early expiry | Near-fresh (rarely misses) | Minimal | Low |
| Stale-while-revalidate | Slightly stale during rebuild | None (serves stale instantly) | Medium |
For most read-heavy APIs where users can tolerate a few seconds of staleness, stale-while-revalidate or probabilistic early expiry is the right call. If your data must be fresh the moment a client requests it (financial data, inventory counts), go with a lock but keep the lock timeout short and the rebuild fast.
Common Pitfalls
Forgetting to delete the lock on exception
If your rebuild function raises an exception and you don't have a finally block around r.delete(lock_key), the lock will hang until its TTL expires. Every worker will block for up to LOCK_TIMEOUT seconds. Always release the lock in a finally clause.
Setting TTL too short relative to rebuild time
If your lock TTL is 2 seconds but your database query takes 3 seconds, the lock expires before the rebuild completes. A second worker acquires the lock, runs a duplicate rebuild, and you're back to redundant work. Set LOCK_TIMEOUT to at least twice your 99th-percentile rebuild time.
Using wall-clock time as your only expiry signal
When your Redis instance is under memory pressure and starts evicting keys with allkeys-lru, keys can disappear long before their TTL. Your stampede prevention code must handle sudden evictions, not just scheduled expiry. Consider setting a maxmemory-policy of volatile-lru so only keys with an explicit TTL are eligible for eviction.
Not testing under concurrent load
Stampede patterns are invisible in unit tests and local development. Write a quick load test that fires 100 concurrent requests at the same endpoint immediately after you delete the cache key. Tools like locust or even a simple Python ThreadPoolExecutor are enough to reproduce the problem in staging before it hits production.
from concurrent.futures import ThreadPoolExecutor
import requests
def hit_endpoint(_):
return requests.get('http://localhost:8000/api/expensive-resource').status_code
with ThreadPoolExecutor(max_workers=100) as pool:
results = list(pool.map(hit_endpoint, range(100)))
print(results) # All should be 200, latency should stay flat.
Next Steps
You now have three solid strategies to choose from, along with the pitfalls that trip up most implementations. Here's where to go from here.
- Instrument your miss rate first. Add a metric counter every time your application hits
nilfrom Redis. You cannot tune what you cannot measure. - Start with probabilistic early expiry if you're adding this to an existing codebase β it requires the least structural change and no coordination between services.
- Add a load test to your CI pipeline that deliberately expires a key under 50-worker concurrency and asserts that your database sees only one or two rebuild queries, not fifty.
- Review your
maxmemory-policysetting in Redis. Make sure unexpected evictions under memory pressure won't bypass your stampede protection. - Consider a Redis client with built-in stampede protection if you're on a stack with mature libraries β some client wrappers (notably in PHP and Go ecosystems) implement locking or PER natively.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!