Fixing ECS Task Failures That Only Appear Under Production Load

May 16, 2026 8 min read 1 views
A glowing amber container icon on a dark server rack background representing a failing ECS task under load

Your staging environment looks clean. Tests pass, the container starts, and your team ships with confidence. Then production traffic arrives and tasks start dying β€” OOM kills, health check failures, or mysterious exits with code 137 that nobody can reproduce locally. If this sounds familiar, you are in the right place.

ECS hides a class of failures that only surface when real concurrency, real data volumes, and real dependency latency combine. Tracking them down requires a different mental model than ordinary debugging.

What you'll learn

  • Why ECS task failures appear in production but not staging, and what categories they fall into
  • How to read CloudWatch Logs and ECS event streams to pinpoint the root cause
  • How to tune CPU, memory, and health check settings to match real traffic patterns
  • How to catch environment drift before it becomes an outage
  • Concrete next steps to harden your ECS setup against load-induced failures

Prerequisites

This guide assumes you have at least one ECS service running on Fargate or EC2 launch type, access to CloudWatch Logs for those tasks, and basic familiarity with task definitions. You don't need to be an AWS expert, but you should be able to navigate the ECS console or run aws ecs CLI commands.

Understand the Four Failure Categories

Before you look at a single log line, it helps to know what you're hunting. Load-induced ECS failures almost always fall into one of four buckets.

  • Resource exhaustion β€” the task runs out of CPU or memory under concurrent requests
  • Health check racing β€” the load balancer marks a task unhealthy before it finishes warming up
  • Environment drift β€” a config value, secret, or network policy differs between staging and production
  • Dependency saturation β€” a downstream service (database, cache, external API) can't keep up and your task times out or panics

Knowing the category halves your debugging time because each one points you to a different place in the AWS console.

Read ECS Events and Stopped Task Reasons First

The fastest first step is to pull the stopped task reason directly from ECS. In the console, navigate to your cluster, click the service, and look at the Tasks tab filtered by Stopped. Click any stopped task and read the Stopped reason field at the top.

From the CLI, you can fetch it with:

aws ecs describe-tasks \
  --cluster your-cluster-name \
  --tasks <task-arn> \
  --query 'tasks[*].{StoppedReason:stoppedReason,ExitCode:containers[0].exitCode}'

Exit code 137 means the Linux kernel sent a SIGKILL β€” almost always an OOM kill. Exit code 1 or 143 points to application error or graceful shutdown. A stopped reason of Task failed ELB health checks tells you the load balancer is the judge, not the app itself.

Diagnose Memory and CPU Exhaustion

ECS enforces hard memory limits. When your container exceeds the value you set in the task definition, the kernel kills it immediately with no stack trace in your application logs β€” which is why developers often think the crash is random.

Check CloudWatch Container Insights

If you have Container Insights enabled on the cluster, open CloudWatch, go to Container Insights > ECS Services, and look at the MemoryUtilization metric over your last peak traffic window. If the line approaches 100% right before tasks die, you have your answer.

If Container Insights is off, enable it now β€” it's a single checkbox in the ECS cluster settings and the cost is typically trivial compared to the debugging time it saves.

Right-size your task definition

A common mistake is copying memory limits from a previous service without profiling the new one. Use this mental checklist:

  1. Run a load test locally with docker stats watching peak RSS memory.
  2. Add at least 25–30% headroom on top of that peak for GC spikes, thread stacks, and request bursts.
  3. Set the task memory (hard limit) to that padded number and optionally set memoryReservation (soft limit) lower to allow bin-packing on EC2 launch type.

For CPU, the unit is vCPU shares where 1024 equals one full vCPU. If your service is CPU-bound during request processing, throttling shows up as increased latency, not crashes β€” but it can trigger health check timeouts indirectly.

Fix Health Check Timing Problems

The load balancer doesn't care that your app needs 20 seconds to load a TensorFlow model or warm a database connection pool. It starts checking immediately after the task enters the RUNNING state, and if the first few probes fail, it deregisters the target before traffic even touches it.

Tune the three health check parameters

In your ALB target group, three values control this behavior:

ParameterWhat it controlsSensible default
Healthy thresholdConsecutive successes before marking healthy2
Unhealthy thresholdConsecutive failures before marking unhealthy3
IntervalSeconds between probes30

With an interval of 30 seconds and an unhealthy threshold of 3, the load balancer gives your task 90 seconds to start responding before killing it. If your app needs more warm-up time, either increase the interval or raise the unhealthy threshold. Don't just raise the threshold blindly though β€” you want real failures to be caught quickly too.

Add a dedicated health endpoint

A root path / is a poor health check target if it runs database queries. Create a lightweight /health endpoint that returns 200 OK as soon as the HTTP server is listening, and move expensive readiness checks to a separate /ready path. Point the ALB probe at /health.

# FastAPI example
from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
def health_check():
    # No DB calls here β€” just confirm the process is alive
    return {"status": "ok"}

@app.get("/ready")
def readiness_check(db=Depends(get_db)):
    # Verify DB connectivity before accepting traffic
    db.execute("SELECT 1")
    return {"status": "ready"}

The ALB uses /health. A separate orchestration layer or startup check can use /ready before shifting traffic.

Hunt Down Environment Drift

Environment drift is the sneakiest failure category because the symptom is often a cryptic application error rather than an OOM kill or health check failure. Common sources include:

  • A secret or environment variable exists in staging but is missing or wrong in the production task definition
  • A security group rule allows outbound access in staging but blocks it in production
  • The production VPC uses private subnets that require a NAT gateway to reach external APIs β€” and the NAT gateway is missing or misconfigured
  • A different version of a shared library or base Docker image is used in the production ECR tag

Validate environment variables at startup

Rather than discovering a missing variable at request time, validate your entire config at process start. A simple pattern in Python:

import os

REQUIRED_ENV_VARS = [
    "DATABASE_URL",
    "SECRET_KEY",
    "REDIS_URL",
    "ALLOWED_HOSTS",
]

def validate_config():
    missing = [v for v in REQUIRED_ENV_VARS if not os.environ.get(v)]
    if missing:
        raise RuntimeError(f"Missing required environment variables: {missing}")

validate_config()  # Called at module import time

When this fails, ECS logs a clean error message and the task exits with code 1, making the cause immediately obvious in CloudWatch Logs instead of a null pointer exception three layers deep.

Audit security groups and VPC routing

A quick way to rule out network issues is to exec into a running production task and test connectivity directly:

# Requires ECS Exec to be enabled on the service
aws ecs execute-command \
  --cluster your-cluster-name \
  --task <task-arn> \
  --container your-container-name \
  --interactive \
  --command "/bin/sh"

# Inside the container
curl -v https://your-database-endpoint:5432
curl -v https://external-api.example.com/ping

If curl hangs or returns a connection refused where it works fine in staging, you've found your network policy mismatch.

Handle Dependency Saturation

Under production load, your tasks may be healthy but they're waiting on a saturated downstream service. This usually surfaces as elevated response times that eventually trigger ALB idle timeout (default 60 seconds), causing 504 errors and task churn.

Common culprits are RDS connection limits, ElastiCache max-connections, and third-party APIs with rate limiting. The fix is usually one of:

  • Add RDS Proxy in front of your database to multiplex connections from many tasks into a managed pool
  • Implement circuit breakers in your application so individual tasks fail fast rather than holding threads waiting for a slow dependency
  • Set explicit connect and read timeouts in every HTTP client and database driver β€” never rely on the default, which is often infinite
import httpx

# Always set timeouts. Never leave them at the default.
async with httpx.AsyncClient(timeout=httpx.Timeout(connect=2.0, read=10.0, write=5.0)) as client:
    response = await client.get("https://internal-service/api/data")

With explicit timeouts, a slow dependency causes a predictable error you can handle, instead of exhausting your thread pool silently.

Common Pitfalls When Debugging ECS Failures

Looking only at application logs. OOM kills and health check failures don't write to your app's stdout. Always check ECS events and the stopped task reason first.

Increasing resources without profiling. Doubling memory without measuring first means you might be masking a memory leak rather than fixing it. Profile under load, then size appropriately.

Forgetting the deregistration delay. When ECS stops a task, the ALB keeps sending traffic to it for a drain period (default 300 seconds). If your graceful shutdown is faster than that, connections get dropped. Set deregistrationDelay.timeoutSeconds on the target group to match your app's actual drain time.

Testing health checks against the wrong port. Fargate tasks with sidecar containers sometimes expose health check ports on non-standard values. Confirm the ALB target group port matches the container port mapping exactly.

Next Steps

You now have a systematic way to approach load-induced ECS failures instead of guessing. Here's what to do immediately:

  1. Enable Container Insights on any ECS cluster that doesn't have it β€” you're flying blind without memory and CPU metrics per task.
  2. Add a /health endpoint to every service if you don't have one, and update ALB target groups to use it.
  3. Review task definition memory limits against your actual peak usage; add at least 25% headroom.
  4. Enable ECS Exec on production services so you can inspect a running task the next time something looks wrong.
  5. Set explicit timeouts on every database connection and HTTP client in your codebase β€” no exceptions.

Production load will keep finding the corners your staging environment never reached. The goal isn't to make failures impossible; it's to make them fast to diagnose and straightforward to fix.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.