Fixing AWS ELB Health Checks That Pass While Your App Serves Errors
Your CloudWatch alarms are firing, users are reporting errors, but the ELB console shows every target as healthy. This is one of the most disorienting situations in AWS operations β your safety net is telling you everything is fine while the house burns down.
The problem is almost never a bug in ELB itself. It's a mismatch between what your health check tests and what your application actually needs to serve real traffic. This article walks you through diagnosing that gap and closing it permanently.
What you'll learn
- Why a health check can return 200 OK while your app is broken
- How to design a health check endpoint that reflects real application state
- How to tune ELB thresholds so bad instances are evicted quickly
- Common misconfigurations that cause false-healthy readings
- How to verify your fix is working before the next incident
Prerequisites
This guide assumes you're using an Application Load Balancer (ALB) with EC2 instances or ECS tasks as targets. Most concepts apply to Network Load Balancers too, but the HTTP-specific sections are ALB-focused. You'll need IAM permissions to edit target groups and access your application's deployment pipeline.
Why Health Checks Lie
ELB health checks are a simple HTTP(S) request to a path you specify. The load balancer marks a target healthy if it receives a response code in the configured success range, typically 200-299. That's it. ELB has no idea what your app is actually doing.
Here's the trap: a web server can respond 200 OK to /health while simultaneously failing to connect to its database, unable to reach a downstream API, or stuck in a state where it processes requests but returns errors to real users. The health check endpoint and the actual application code are different paths β and if your health endpoint doesn't exercise the same dependencies, you get a false positive.
The second common cause is a health check that's too specific. Some teams configure the health check to hit /, the application root, which redirects to a login page with a 301. If your success codes don't include 3xx, the instance flaps or stays unhealthy. Conversely, if they do include 3xx, a broken app that always redirects would appear healthy forever.
Anatomy of a Useful Health Check Endpoint
A good health check endpoint does three things: it responds fast, it checks real dependencies, and it returns a non-2xx code when something is genuinely broken. Aim for a response under 500ms β slow health checks mask latency problems and can themselves cause timeout failures.
Here's a minimal example in Python (Flask) that checks a database connection before returning 200:
import time
from flask import Flask, jsonify
from sqlalchemy import text
from db import engine # your existing SQLAlchemy engine
app = Flask(__name__)
@app.route("/health")
def health():
checks = {}
# Database check
try:
start = time.monotonic()
with engine.connect() as conn:
conn.execute(text("SELECT 1"))
checks["db"] = {"status": "ok", "latency_ms": round((time.monotonic() - start) * 1000)}
except Exception as exc:
checks["db"] = {"status": "error", "detail": str(exc)}
all_ok = all(v["status"] == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), status_code
The critical detail: return 503 when a dependency is down, not 200 with an error body. ELB only reads the HTTP status code. If you return 200 with {"status": "error"} in the JSON, ELB will happily mark the instance healthy.
What dependencies to check
Include checks for anything your app cannot function without. A typical list looks like this:
- Primary database β a lightweight
SELECT 1or equivalent ping - Cache layer (Redis, Memcached) β a
PINGcommand - Critical third-party APIs β only if your app is completely non-functional without them
- Application-level state β for example, whether required config has loaded or background workers are running
Don't check every optional integration. If a payment provider is down but your app can still serve read-only pages, a healthy status is correct. Health checks should reflect whether this instance can serve traffic, not whether the world is perfect.
What not to include
Avoid anything that writes to your database on every health check β ELB can hit your endpoint several times per minute per instance, and that adds up. Also avoid external HTTP calls with long timeouts; a slow health check that occasionally times out will cause unnecessary instance churn.
Configuring the Target Group Health Check Settings
Once your endpoint is solid, the ELB settings control how quickly bad instances are detected and removed. Open your target group in the AWS console and review each setting with intention.
| Setting | Default | Recommended starting point |
|---|---|---|
| Health check path | / | /health (your new endpoint) |
| Healthy threshold | 5 | 2β3 |
| Unhealthy threshold | 2 | 2 |
| Timeout | 5s | 3β4s |
| Interval | 30s | 10β15s |
| Success codes | 200 | 200 (be explicit, avoid ranges) |
The interval and unhealthy threshold together determine how long a broken instance stays in rotation. With a 30s interval and an unhealthy threshold of 2, a failing instance can serve traffic for up to 60 seconds before ELB removes it. Drop the interval to 10s and you're down to 20 seconds of blast radius.
The healthy threshold controls how many consecutive successes are needed before a new or recovering instance receives traffic. A value of 2 means a restarted instance is back in rotation after 20 seconds at a 10s interval β fast enough for most deployments without letting flapping instances back in too quickly.
Updating via AWS CLI
If you manage infrastructure as code (and you should), here's the CLI equivalent:
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
--health-check-path /health \
--health-check-interval-seconds 10 \
--health-check-timeout-seconds 4 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 2 \
--matcher HttpCode=200
Commit this to your Terraform or CloudFormation templates so these settings don't drift back to defaults on the next infrastructure update.
The Graceful Shutdown Problem
A related scenario: an instance is healthy according to ELB, starts shutting down (during a deployment or scale-in event), and ELB keeps sending traffic to it for up to 30 seconds because it hasn't failed enough checks yet.
AWS solves this with connection draining (called
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!