Diagnosing Flapping ALB Health Checks That Kill Healthy ECS Tasks
Your ECS service is running, your application responds fine when you curl it directly, but the ALB keeps cycling tasks out. New tasks start, the load balancer marks them unhealthy, they get killed, and the cycle repeats. Meanwhile, real traffic is getting dropped or served 502s.
Flapping ALB health checks are infuriating precisely because the symptom looks like a bad deployment or a crashing container when the actual problem is often a tuning mismatch, a slow startup path, or a subtle race condition in the task lifecycle.
What you'll learn
- How ALB health checks interact with the ECS task registration and deregistration flow
- The most common root causes of flapping, from startup timing to misconfigured thresholds
- Where to look first: logs, metrics, and target group events
- How to tune health check parameters without introducing new failure modes
- Deregistration delay gotchas that cause connection drops under load
Prerequisites
This guide assumes you're running ECS on Fargate or EC2 launch type with an Application Load Balancer. You should be comfortable reading the AWS Console and have CloudWatch access. The examples use the AWS CLI β version 2 is assumed throughout.
How ALB health checks interact with ECS task lifecycle
When ECS starts a new task and registers it with a target group, the ALB immediately begins sending health check requests at the configured interval. The task must return the expected HTTP status code (usually 200) within the timeout window, consecutively, before the ALB marks it healthy and starts routing real traffic to it.
The problem is that ECS considers a task "running" as soon as the container process has started β not when your application is actually ready to serve requests. That gap between process start and application readiness is where most flapping originates.
If the health check fires before your app is up, the ALB counts that as a failure. Accumulate enough consecutive failures and the target goes unhealthy. ECS then replaces it, the new task has the same cold-start delay, and the cycle repeats.
The most common root causes of flapping
Startup time exceeds health check grace period
ECS services have a healthCheckGracePeriodSeconds setting that tells the service scheduler to ignore ALB health check status for that many seconds after a task first registers. If your app takes 30 seconds to initialize but the grace period is 10 seconds, the ALB will fail the task before it ever had a chance.
Check your current setting with:
aws ecs describe-services \
--cluster your-cluster-name \
--services your-service-name \
--query 'services[0].healthCheckGracePeriodSeconds'If the value is zero or much lower than your actual startup time, this is your culprit.
Health check path returns non-200 during boot
Some frameworks initialize routes lazily. If your health check hits /health while the router hasn't registered that route yet, you'll get a 404 or 503. The ALB doesn't care why you returned a non-200 β it just counts the failure.
A quick way to verify this is to SSH into a task (or run a temporary ECS exec session) and watch what the health endpoint returns immediately after process start:
aws ecs execute-command \
--cluster your-cluster-name \
--task your-task-id \
--container your-container-name \
--interactive \
--command "watch -n1 curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/health"If you see it cycle through non-200 codes before settling, your app isn't ready at registration time.
Container port mapping mismatch
The ALB health check sends traffic to a specific port on the task's ENI. If your task definition maps container port 8080 but the target group is configured to check port 80, every health check will time out. This is a common copy-paste error when cloning target groups across environments.
Resource exhaustion causing slow responses
A task that's CPU-throttled or hitting its memory limit will respond slowly. If the health check timeout is set too low (the default is 5 seconds), intermittent slowness under load will trigger failures even though the app is technically alive. This is the hardest variant to diagnose because it's load-dependent and the task looks fine at low traffic.
Reading the signals: where to look first
Before tuning anything, gather evidence. Guessing at configuration changes without understanding the failure mode will likely make things worse.
Start with the target group's Target health tab in the EC2 console. Each target shows its current health status and, critically, the reason code when it's unhealthy. Common reason codes:
Target.ResponseCodeMismatchβ the app returned an unexpected HTTP statusTarget.Timeoutβ the health check request timed outTarget.FailedHealthChecksβ consecutive check failures exceeded the thresholdElb.InternalErrorβ the load balancer itself had a problem reaching the target
The reason code tells you whether the problem is the app responding wrong, not responding at all, or something at the network layer.
Checking target group and task-level metrics
CloudWatch gives you several metrics worth watching together. In the EC2 console under your load balancer, look at:
- UnHealthyHostCount β how many targets are currently failing checks
- HealthyHostCount β should never hit zero in a multi-task service
- TargetResponseTime β p99 response time; if this is climbing toward your timeout value, you're at risk
- HTTPCode_Target_5XX_Count β 5xx responses from tasks, distinct from 5xx generated by the ALB itself
Run this CLI query to pull the last hour of unhealthy host counts at one-minute resolution:
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name UnHealthyHostCount \
--dimensions Name=LoadBalancer,Value=app/your-alb-name/abc123 \
Name=TargetGroup,Value=targetgroup/your-tg-name/def456 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics MaximumCorrelate spikes in UnHealthyHostCount with your ECS service events (visible in the console under the service's Events tab). If tasks are being stopped and replaced at the same time unhealthy counts spike, you've confirmed the flapping loop.
While you're investigating container startup issues, it's also worth reviewing whether secrets retrieval is adding to your boot time β ECS task failures caused by Secrets Manager timeouts during startup can compound a flapping problem significantly.
Tuning health check thresholds the right way
Once you know the failure mode, you can adjust the right parameter. Don't just blindly increase every threshold β each change has a tradeoff.
Health check interval and timeout
The interval controls how often checks are sent (minimum 5 seconds, default 30). The timeout must be less than the interval. For most web services a 10-second timeout with a 30-second interval is a reasonable starting point. Reducing the interval makes the ALB detect failures faster, but it also means a single slow response is more likely to count as a failure.
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:... \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3Healthy and unhealthy thresholds
The healthy threshold is the number of consecutive successful checks required to mark a target healthy. The unhealthy threshold is the number of consecutive failures required to mark it unhealthy. Defaults are typically 5 and 2 respectively.
Raising the unhealthy threshold to 3 or 4 gives your app more room to absorb a single slow response without being killed. Lowering the healthy threshold to 2 means new tasks get into rotation faster after startup β useful if your startup time is predictable and short.
Health check grace period
Set healthCheckGracePeriodSeconds to at least 1.5x your measured p95 startup time. If your app typically starts in 20 seconds but occasionally takes 35, set the grace period to at least 50 seconds. Update it on the ECS service:
aws ecs update-service \
--cluster your-cluster-name \
--service your-service-name \
--health-check-grace-period-seconds 60Build a dedicated health endpoint
Your health check path should be lightweight and ready as early as possible in your application's startup sequence. A good health endpoint does three things: it confirms the HTTP server is accepting connections, it confirms any critical dependencies (like a database connection pool) are initialized, and it returns in under 100ms under any load. Don't use your homepage or an API endpoint that does real work.
# FastAPI example β register this route before any slow initialization
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
async def health_check():
return {"status": "ok"}Register the health route before your startup hooks run. That way the ALB can confirm the process is alive even if the rest of the application is still warming up.
Deregistration delay and connection draining pitfalls
When ECS decides to stop a task β whether from a deployment, a scale-in, or a health check failure β it first tells the ALB to deregister the target. The ALB then waits for the deregistration delay (default 300 seconds) before it stops sending traffic to that target, giving in-flight requests time to complete.
Three hundred seconds is an eternity during a rolling deployment or when a flapping task needs to be replaced quickly. In practice, most applications can drain connections in under 30 seconds. Lower this to a value that matches your longest expected request duration:
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=deregistration_delay.timeout_seconds,Value=30The flip side: if you set the deregistration delay lower than your longest-running requests, those requests will be cut off mid-flight when a task is replaced. For services that handle long-running operations or large file uploads, keep the delay high enough to cover your p99 request duration.
Also check whether your container's stopTimeout in the task definition is high enough. ECS sends SIGTERM to the container and then waits stopTimeout seconds before sending SIGKILL. If stopTimeout is shorter than the deregistration delay, the process dies before draining completes.
Common pitfalls and gotchas
- Health check path requires authentication. If your app redirects unauthenticated requests to a login page, the ALB will receive a 302 and mark the target unhealthy. Always use a path that returns 200 without auth.
- Security group blocks ALB health check traffic. The task's security group must allow inbound traffic from the ALB's security group (or the ALB's IP range for non-VPC setups) on the container port. A missing inbound rule causes
Target.Timeoutfailures. - Changing health check settings doesn't apply immediately. Existing targets continue using old thresholds until the next check cycle. During a flapping incident, you may need to force new task registrations to see the effect of your changes.
- EC2 launch type host-port conflicts. If you're using the EC2 launch type with static port mappings and multiple tasks per instance, two tasks can fight over the same host port. Use dynamic port mapping (
"hostPort": 0in the task definition) to avoid this. - Slow health checks masking an actual problem. Raising every threshold to maximum isn't a solution β it just delays detection of real failures. Always understand the root cause before increasing tolerance.
If you're managing automated service deployments and want to avoid health check failures during releases, it's worth connecting this to a reliable deployment versioning strategy. Automating semantic versioning in your release pipeline can help ensure that bad builds don't silently reach production where a flapping health check is the first sign something is wrong.
Wrapping up: next steps to stop the churn
Flapping health checks are almost always a symptom of a mismatch between your application's actual behavior and the load balancer's expectations. The fix is rarely a single setting change β it's understanding the full lifecycle from task start to first healthy check, then tuning each parameter with evidence in hand.
Here's what to do after reading this:
- Measure your actual startup time end-to-end (from task registration to first successful health check response) using ECS service events and CloudWatch, then set your grace period to 1.5x that value.
- Check the target group reason code for any unhealthy targets β
Target.Timeoutvs.Target.ResponseCodeMismatchpoint to completely different root causes. - Build or improve your
/healthendpoint so it registers early in the startup sequence and responds in under 100ms with a plain 200 under all load conditions. - Lower your deregistration delay to match your actual p99 request duration, then verify your container's
stopTimeoutis longer than that value. - Set an alarm on
UnHealthyHostCount > 0in CloudWatch so you catch future flapping before it cascades into a full service outage.
Frequently Asked Questions
Why does the ALB keep marking my ECS tasks unhealthy even though the app works fine when I test it manually?
The most common reason is a startup timing gap β the ALB sends health checks as soon as the task registers, but your application isn't ready to respond yet. Setting a longer healthCheckGracePeriodSeconds on your ECS service gives the app time to initialize before health check failures can trigger task replacement.
How do I find out why the ALB is failing health checks on my ECS target group?
Open the EC2 console, navigate to your target group, and check the Health status column for each target. AWS shows a reason code like Target.Timeout or Target.ResponseCodeMismatch that tells you exactly what the load balancer observed. You can also query CloudWatch metrics like UnHealthyHostCount and HTTPCode_Target_5XX_Count to correlate failures with timestamps.
What is the right deregistration delay value for ECS services behind an ALB?
The right value is slightly longer than your application's p99 request duration. The default of 300 seconds is too long for most services and slows down deployments. For APIs with fast responses, 30 to 60 seconds is usually sufficient, but long-running upload or streaming endpoints may need a higher value.
Can a security group misconfiguration cause ALB health checks to fail on ECS tasks?
Yes. The task's security group must allow inbound TCP traffic from the ALB's security group on the container port. If that inbound rule is missing, health check requests are silently dropped and the ALB records Target.Timeout failures even though the application inside the container is running correctly.
How many consecutive failures does it take for an ALB to mark an ECS target unhealthy?
That is controlled by the unhealthy threshold count setting on the target group, which defaults to 2 consecutive failures. You can raise it to 3 or 4 to tolerate occasional slow responses without triggering task replacement, but avoid setting it so high that real failures take too long to detect.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!