Fixing Django Celery Tasks That Silently Fail Without Retries

May 23, 2026 6 min read 46 views
Abstract illustration of a task pipeline with one failing node glowing red among blue connected nodes on a dark background

You deployed a feature that sends emails or processes uploads via a Celery task. Everything looks fine in staging. Then a user complains their file never processed, and when you check your worker logs, there's nothing. No error, no traceback β€” the task just vanished. This is what silent Celery failure looks like, and it's more common than you'd expect.

What you'll learn

  • Why Celery tasks fail without raising visible errors
  • How to configure automatic retries with exponential backoff
  • How to catch and re-raise exceptions correctly inside tasks
  • How to set up dead-letter queues and monitoring hooks
  • Common configuration mistakes that disable retry behavior entirely

Prerequisites

This guide assumes you're running Django 3.2+ with Celery 5.x and a Redis or RabbitMQ broker. The patterns apply to both brokers unless noted. You should already have a working Celery app object connected to Django.

Why Tasks Fail Silently

Celery catches exceptions inside tasks by default only when you've configured it to do so. If you haven't, an unhandled exception marks the task as FAILURE in the result backend β€” but it never retries, and it never alerts you unless you're actively watching the result store or have a monitoring tool hooked up.

There are three common reasons a task appears to do nothing:

  • The exception is swallowed by a bare except block inside your task code.
  • The task has no autoretry_for or self.retry() call, so on failure it simply stops.
  • The result backend isn't configured, so the failure state is never persisted anywhere you can see it.

Let's fix each one systematically.

The Anatomy of a Broken Task

Here's a pattern that looks reasonable but hides failures completely:

# tasks.py β€” DON'T do this
from celery import shared_task
import requests

@shared_task
def fetch_data(url):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return response.json()
    except Exception:
        pass  # <-- this kills all visibility

The bare except Exception: pass means every network error, timeout, and JSON parse failure disappears. The task returns None, Celery marks it SUCCESS, and you have no idea anything went wrong. Remove silent exception swallowing before anything else.

Configuring Automatic Retries

Celery 5.x introduced autoretry_for, which is the cleanest way to enable retries for specific exception types. You declare which exceptions should trigger a retry directly on the decorator.

# tasks.py β€” correct retry configuration
from celery import shared_task
from celery.utils.log import get_task_logger
import requests

logger = get_task_logger(__name__)

@shared_task(
    bind=True,
    autoretry_for=(requests.exceptions.RequestException, ConnectionError),
    retry_kwargs={"max_retries": 5},
    retry_backoff=True,
    retry_backoff_max=600,  # cap backoff at 10 minutes
    retry_jitter=True,
)
def fetch_data(self, url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

retry_backoff=True tells Celery to double the wait time between each retry attempt. retry_jitter=True adds a small random offset so workers don't all hammer the same downstream service at once. These two options together are important for any task that calls an external API.

Manual Retries When You Need More Control

Sometimes you need to inspect the exception before deciding whether to retry β€” for example, you want to retry on a 503 but not on a 404. Use self.retry() inside a try/except block for that:

from celery import shared_task
from celery.utils.log import get_task_logger
import requests

logger = get_task_logger(__name__)

@shared_task(bind=True, max_retries=4)
def fetch_data(self, url):
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 503:
            raise self.retry(countdown=30 * (2 ** self.request.retries))
        if response.status_code == 404:
            logger.warning("Resource not found at %s β€” not retrying", url)
            return None
        response.raise_for_status()
        return response.json()
    except requests.exceptions.ConnectionError as exc:
        raise self.retry(exc=exc, countdown=10)

Notice that self.retry() raises a Retry exception internally β€” that's why you use raise self.retry(...). Calling it without raise means the retry is registered but execution continues past that line, which is almost never what you want.

Configuring the Result Backend So You Can See Failures

If CELERY_RESULT_BACKEND is not set, task state is never stored, and you cannot query whether a task succeeded or failed. Add it to your Django settings:

# settings.py
CELERY_BROKER_URL = "redis://localhost:6379/0"
CELERY_RESULT_BACKEND = "redis://localhost:6379/1"

# Keep task results for 24 hours
CELERY_RESULT_EXPIRES = 86400

# Store exceptions in the result backend so you can inspect them
CELERY_TASK_SERIALIZER = "json"
CELERY_RESULT_SERIALIZER = "json"
CELERY_ACCEPT_CONTENT = ["json"]

Using a separate Redis database index (/1 vs /0) for the result backend is a small but useful habit. It makes it easier to flush results independently of your task queue without accidentally clearing jobs that haven't run yet.

Using Task Signals for Error Visibility

Celery provides signals you can hook into to log failures centrally rather than duplicating error-handling logic in every task. The task_failure signal fires after a task exhausts all retries.

# signals.py
from celery.signals import task_failure
import logging

logger = logging.getLogger("celery.failures")

@task_failure.connect
def handle_task_failure(sender=None, task_id=None, exception=None, traceback=None, einfo=None, **kwargs):
    logger.error(
        "Task %s (id=%s) failed permanently: %s",
        sender.name,
        task_id,
        str(exception),
        exc_info=True,
    )

Connect this signal in your Django app's AppConfig.ready() method so it's registered when the worker starts. This gives you one place to pipe failures to Sentry, PagerDuty, or any alerting tool without touching individual tasks.

Dead-Letter Queues for Tasks That Exhaust Retries

When a task runs out of retries, you don't want that work to disappear. A dead-letter queue (DLQ) is a separate queue where permanently failed tasks land so you can inspect them or replay them later.

With RabbitMQ, you can configure a DLQ at the broker level. With Redis, the common approach is to catch the final failure inside the task and push it to a separate queue or log it to a database:

from celery import shared_task
from myapp.models import FailedTaskLog
import requests

@shared_task(
    bind=True,
    autoretry_for=(requests.exceptions.RequestException,),
    retry_kwargs={"max_retries": 3},
    retry_backoff=True,
)
def fetch_data(self, url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as exc:
        if self.request.retries == self.max_retries:
            FailedTaskLog.objects.create(
                task_name=self.name,
                task_id=self.request.id,
                args=str(url),
                error=str(exc),
            )
        raise

Storing failed tasks in a Django model makes them queryable, replayable via the admin, and auditable. It's a pattern worth building early in any project that relies heavily on background processing.

Common Pitfalls and Gotchas

Forgetting bind=True

If you use self.retry() but forget bind=True on the decorator, Python will raise a TypeError about unexpected arguments β€” and that error itself might be swallowed depending on your setup. Always pair bind=True with any task that references self.

Retrying on exceptions you shouldn't

Using autoretry_for=(Exception,) is a tempting shortcut, but it will retry on programming errors like AttributeError or KeyError. These will never succeed on retry; you're just delaying the inevitable and wasting worker cycles. Be explicit about which exception types warrant a retry.

Long countdown values blocking workers

When you use countdown=3600 (one hour), the task sits in a worker's reserved slot for that hour rather than returning to the queue. Use eta or countdown with care, and consider Celery Beat or a dedicated retry queue for very long delays.

Missing CELERY_TASK_ALWAYS_EAGER in tests

In test environments, CELERY_TASK_ALWAYS_EAGER = True runs tasks synchronously. But eager mode does not simulate retry behavior β€” self.retry() will raise Retry as a real exception instead of re-queuing the task. Write dedicated tests for retry logic using task.apply() and mock the failing dependency explicitly.

Worker concurrency and task timeouts

A task that hangs rather than raises an exception will never trigger retry logic. Set CELERYD_TASK_SOFT_TIME_LIMIT and CELERYD_TASK_TIME_LIMIT in your settings so hung tasks are interrupted and can be retried:

# settings.py
CELERYD_TASK_SOFT_TIME_LIMIT = 300   # raises SoftTimeLimitExceeded after 5 min
CELERYD_TASK_TIME_LIMIT = 360        # hard kill after 6 min

Catch SoftTimeLimitExceeded in your task if you need to do cleanup before the hard kill lands.

Next Steps

You now have the building blocks for a reliable Celery task queue. Here's what to do next:

  • Audit existing tasks for bare except blocks and missing autoretry_for declarations. Start with tasks that touch external APIs or file I/O.
  • Set up Flower (the Celery monitoring tool) or integrate with your existing APM so task failures surface in real time rather than in user complaints.
  • Add a FailedTaskLog model (or equivalent) and a Django admin view so non-engineers can see and replay failed jobs without needing terminal access.
  • Write integration tests that mock the failing dependency and assert that your task retries the correct number of times before giving up.
  • Review your time limits across all task types and set sensible soft and hard limits to prevent hung workers from blocking the queue.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.