Redis Cache Stampede: How to Fix It When Keys Expire

Your Redis cache is humming along fine — until a key expires at the wrong moment. Suddenly ten, fifty, or a hundred workers all miss the cache simultaneously and each one fires a full database query to rebuild the value. Your database melts, latency spikes, and users see errors. This is the cache stampede problem, and it is more common than most teams expect.

What you'll learn

Why cache stampede happens and how to spot it in production
How to use distributed locking to serialize cache rebuilds
How probabilistic early expiry eliminates the problem at the source
How SETNX-based patterns and Lua scripts keep things atomic
Practical trade-offs between each approach so you can pick the right one

Prerequisites

You should be comfortable writing Python or reading it closely enough to translate examples to your own language. The code samples use the redis-py library. You'll need Redis 6 or later running locally or in a staging environment to test these patterns yourself.

Why Stampede Happens

Every cached value has a TTL. The moment that TTL reaches zero, Redis evicts the key. If your traffic is low, one worker rebuilds the value and sets it before the next request arrives — no problem. But under load, dozens of concurrent requests can all call GET mykey, all receive nil, and all immediately start querying the database to compute a fresh value.

The rebuild itself is usually expensive — a complex SQL join, an external API call, a heavy aggregation. Multiplying that cost by 50 simultaneous workers turns a manageable operation into a thundering herd. The database buckles, Redis gets a flood of SET calls for the same key a millisecond apart, and latency climbs for every user on the platform.

This is sometimes called the dog-pile effect or thundering herd. All three names describe the same race condition: many processes doing redundant work because a shared signal (the cached value) disappeared at the same instant.

Detecting Stampede in Production

Before you fix anything, confirm you're actually looking at a stampede rather than a different bottleneck. Two signals give it away quickly.

Database query spikes tied to cache misses

Graph your database slow-query count alongside your Redis keyspace miss rate (keyspace_misses from INFO stats). During a stampede you'll see both lines spike at exactly the same moment, and the DB spike will look disproportionate — more queries than you'd expect for the traffic volume.

Redis INFO stats

redis-cli INFO stats | grep -E 'keyspace_hits|keyspace_misses'

A healthy cache keeps the miss rate well below five percent during steady-state traffic. A sudden jump to 30–80 percent during a key-expiry event is a strong indicator of stampede. Pair this with APM traces showing many identical downstream queries executing in the same 100ms window and the diagnosis is confirmed.

Strategy 1: Distributed Locking (Mutex on Rebuild)

The most direct fix is to make only one worker responsible for rebuilding a given key. Every other worker that misses the cache waits for the lock holder to finish, then reads the freshly cached value.

The pattern looks like this: try to acquire a Redis lock before starting the expensive rebuild. If you get the lock, rebuild and release. If you don't get the lock, wait briefly and retry the cache read.

import time
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

LOCK_TIMEOUT = 10   # seconds the lock is held at most
WAIT_INTERVAL = 0.05  # 50ms between retries
MAX_WAIT = 5.0       # give up waiting after 5 seconds

def get_with_lock(cache_key: str, rebuild_fn, ttl: int = 300) -> str:
    value = r.get(cache_key)
    if value is not None:
        return value

    lock_key = f'lock:{cache_key}'
    acquired = r.set(lock_key, '1', nx=True, ex=LOCK_TIMEOUT)

    if acquired:
        try:
            value = rebuild_fn()
            r.set(cache_key, value, ex=ttl)
        finally:
            r.delete(lock_key)
        return value

    # Another worker holds the lock — wait for the cache to be populated.
    waited = 0.0
    while waited < MAX_WAIT:
        time.sleep(WAIT_INTERVAL)
        waited += WAIT_INTERVAL
        value = r.get(cache_key)
        if value is not None:
            return value

    # Fallback: if the lock holder died, do the rebuild ourselves.
    return rebuild_fn()

The nx=True flag on SET makes the lock acquisition atomic — only one caller gets True back. The ex=LOCK_TIMEOUT ensures the lock expires automatically if the worker crashes mid-rebuild, preventing a permanent lockout.

When this approach fits

Distributed locking works well when the rebuild is slow and you genuinely want to block redundant work. It adds latency for waiting workers, so it is less suitable when your rebuild is fast and slightly stale data is acceptable.

Strategy 2: Probabilistic Early Expiry (PER)

Locking introduces coordination overhead. Probabilistic early expiry avoids the race entirely by refreshing the value before it expires, based on a random roll that becomes more likely as the key approaches its TTL deadline.

The idea comes from a well-known caching paper. Each worker that reads the key evaluates a small formula: should I preemptively rebuild now, even though the key is still valid? The probability of answering yes increases as expiry gets closer. By the time the key actually expires, the cache has almost certainly already been refreshed by one early worker.

import math
import random
import time
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

BETA = 1.0  # controls how aggressively to pre-fetch; 1.0 is the standard default

def get_with_per(cache_key: str, rebuild_fn, ttl: int = 300) -> str:
    """
    Each cached entry is stored alongside its recompute time and TTL
    so workers can evaluate the early-expiry formula.
    """
    import json

    raw = r.get(cache_key)
    if raw:
        entry = json.loads(raw)
        expiry_time = entry['stored_at'] + entry['ttl']
        recompute_time = entry['recompute_time']
        remaining = expiry_time - time.time()

        # XFetch formula: refresh early if this condition holds.
        if remaining - BETA * recompute_time * math.log(random.random()) > 0:
            return entry['value']
        # Otherwise fall through to rebuild.

    start = time.time()
    value = rebuild_fn()
    recompute_time = time.time() - start

    import json
    entry = json.dumps({
        'value': value,
        'stored_at': time.time(),
        'ttl': ttl,
        'recompute_time': recompute_time,
    })
    r.set(cache_key, entry, ex=ttl)
    return value

The math.log(random.random()) term produces a negative number that grows in magnitude with each worker call, pushing the condition toward triggering a rebuild sooner. Tune BETA upward to make early refreshes happen more aggressively, or downward to make them rarer.

When this approach fits

PER is excellent when slight read overhead is acceptable and you want zero coordination between workers. It degrades gracefully: if no worker happens to trigger an early rebuild, the worst case is a single expiry miss — far better than a stampede.

Strategy 3: Stale-While-Revalidate with a Background Worker

A third pattern is to serve stale data immediately on a miss and kick off an asynchronous rebuild in the background. Workers never block; they always return something, even if it is slightly outdated.

import threading
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_stale_while_revalidate(
    cache_key: str,
    stale_key: str,
    rebuild_fn,
    ttl: int = 300,
    stale_ttl: int = 600,
) -> str:
    value = r.get(cache_key)
    if value is not None:
        return value

    # Primary key is gone — serve stale data and trigger async rebuild.
    stale = r.get(stale_key)
    rebuild_lock = f'rebuilding:{cache_key}'
    already_rebuilding = r.set(rebuild_lock, '1', nx=True, ex=ttl + 10)

    if already_rebuilding:
        def background_rebuild():
            try:
                fresh = rebuild_fn()
                r.set(cache_key, fresh, ex=ttl)
                r.set(stale_key, fresh, ex=stale_ttl)
            finally:
                r.delete(rebuild_lock)

        threading.Thread(target=background_rebuild, daemon=True).start()

    if stale is not None:
        return stale

    # No stale data exists at all — must block and wait.
    return rebuild_fn()

Here you maintain two keys: a primary key with a short TTL and a stale key with a longer TTL. When the primary expires, requests immediately get the stale value while one background thread refreshes the primary. By the time another TTL cycle completes, fresh data is back in place.

Choosing a Strategy

Each approach makes a different trade-off between complexity, latency, and data freshness. Here's a quick comparison.

Strategy	Data freshness	Added latency on miss	Complexity
Distributed lock	Immediate on unlock	High (waiting workers block)	Medium
Probabilistic early expiry	Near-fresh (rarely misses)	Minimal	Low
Stale-while-revalidate	Slightly stale during rebuild	None (serves stale instantly)	Medium

For most read-heavy APIs where users can tolerate a few seconds of staleness, stale-while-revalidate or probabilistic early expiry is the right call. If your data must be fresh the moment a client requests it (financial data, inventory counts), go with a lock but keep the lock timeout short and the rebuild fast.

Common Pitfalls

Forgetting to delete the lock on exception

If your rebuild function raises an exception and you don't have a finally block around r.delete(lock_key), the lock will hang until its TTL expires. Every worker will block for up to LOCK_TIMEOUT seconds. Always release the lock in a finally clause.

Setting TTL too short relative to rebuild time

If your lock TTL is 2 seconds but your database query takes 3 seconds, the lock expires before the rebuild completes. A second worker acquires the lock, runs a duplicate rebuild, and you're back to redundant work. Set LOCK_TIMEOUT to at least twice your 99th-percentile rebuild time.

Using wall-clock time as your only expiry signal

When your Redis instance is under memory pressure and starts evicting keys with allkeys-lru, keys can disappear long before their TTL. Your stampede prevention code must handle sudden evictions, not just scheduled expiry. Consider setting a maxmemory-policy of volatile-lru so only keys with an explicit TTL are eligible for eviction.

Not testing under concurrent load

Stampede patterns are invisible in unit tests and local development. Write a quick load test that fires 100 concurrent requests at the same endpoint immediately after you delete the cache key. Tools like locust or even a simple Python ThreadPoolExecutor are enough to reproduce the problem in staging before it hits production.

from concurrent.futures import ThreadPoolExecutor
import requests

def hit_endpoint(_):
    return requests.get('http://localhost:8000/api/expensive-resource').status_code

with ThreadPoolExecutor(max_workers=100) as pool:
    results = list(pool.map(hit_endpoint, range(100)))

print(results)  # All should be 200, latency should stay flat.

Next Steps

You now have three solid strategies to choose from, along with the pitfalls that trip up most implementations. Here's where to go from here.

Instrument your miss rate first. Add a metric counter every time your application hits nil from Redis. You cannot tune what you cannot measure.
Start with probabilistic early expiry if you're adding this to an existing codebase — it requires the least structural change and no coordination between services.
Add a load test to your CI pipeline that deliberately expires a key under 50-worker concurrency and asserts that your database sees only one or two rebuild queries, not fifty.
Review your maxmemory-policy setting in Redis. Make sure unexpected evictions under memory pressure won't bypass your stampede protection.
Consider a Redis client with built-in stampede protection if you're on a stack with mature libraries — some client wrappers (notably in PHP and Go ecosystems) implement locking or PER natively.

Fixing Redis Cache Stampede When Multiple Workers Hit Expired Keys

What you'll learn

Prerequisites

Why Stampede Happens

Detecting Stampede in Production

Database query spikes tied to cache misses

Redis INFO stats

Strategy 1: Distributed Locking (Mutex on Rebuild)

When this approach fits

Strategy 2: Probabilistic Early Expiry (PER)

When this approach fits

Strategy 3: Stale-While-Revalidate with a Background Worker

Choosing a Strategy

Common Pitfalls

Forgetting to delete the lock on exception

Setting TTL too short relative to rebuild time

Using wall-clock time as your only expiry signal

Not testing under concurrent load

Next Steps

Related Articles

Fixing Celery Tasks That Silently Fail Without Raising Exceptions

Postgres EXPLAIN ANALYZE: Reading Query Plans to Kill Slow Joins

Diagnosing Slow Docker Builds and Cutting Them Down Significantly

Comments (0)

Leave a Comment

Fixing Redis Cache Stampede When Multiple Workers Hit Expired Keys

What you'll learn

Prerequisites

Why Stampede Happens

Detecting Stampede in Production

Database query spikes tied to cache misses

Redis INFO stats

Strategy 1: Distributed Locking (Mutex on Rebuild)

When this approach fits

Strategy 2: Probabilistic Early Expiry (PER)

When this approach fits

Strategy 3: Stale-While-Revalidate with a Background Worker

Choosing a Strategy

Common Pitfalls

Forgetting to delete the lock on exception

Setting TTL too short relative to rebuild time

Using wall-clock time as your only expiry signal

Not testing under concurrent load

Next Steps

Related Articles

Fixing Celery Tasks That Silently Fail Without Raising Exceptions

Postgres EXPLAIN ANALYZE: Reading Query Plans to Kill Slow Joins

Diagnosing Slow Docker Builds and Cutting Them Down Significantly

Comments (0)

Leave a Comment

Stay ahead of the curve