AWS ELB Health Checks Passing While App Serves Errors

Your CloudWatch alarms are firing, users are reporting errors, but the ELB console shows every target as healthy. This is one of the most disorienting situations in AWS operations — your safety net is telling you everything is fine while the house burns down.

The problem is almost never a bug in ELB itself. It's a mismatch between what your health check tests and what your application actually needs to serve real traffic. This article walks you through diagnosing that gap and closing it permanently.

What you'll learn

Why a health check can return 200 OK while your app is broken
How to design a health check endpoint that reflects real application state
How to tune ELB thresholds so bad instances are evicted quickly
Common misconfigurations that cause false-healthy readings
How to verify your fix is working before the next incident

Prerequisites

This guide assumes you're using an Application Load Balancer (ALB) with EC2 instances or ECS tasks as targets. Most concepts apply to Network Load Balancers too, but the HTTP-specific sections are ALB-focused. You'll need IAM permissions to edit target groups and access your application's deployment pipeline.

Why Health Checks Lie

ELB health checks are a simple HTTP(S) request to a path you specify. The load balancer marks a target healthy if it receives a response code in the configured success range, typically 200-299. That's it. ELB has no idea what your app is actually doing.

Here's the trap: a web server can respond 200 OK to /health while simultaneously failing to connect to its database, unable to reach a downstream API, or stuck in a state where it processes requests but returns errors to real users. The health check endpoint and the actual application code are different paths — and if your health endpoint doesn't exercise the same dependencies, you get a false positive.

The second common cause is a health check that's too specific. Some teams configure the health check to hit /, the application root, which redirects to a login page with a 301. If your success codes don't include 3xx, the instance flaps or stays unhealthy. Conversely, if they do include 3xx, a broken app that always redirects would appear healthy forever.

Anatomy of a Useful Health Check Endpoint

A good health check endpoint does three things: it responds fast, it checks real dependencies, and it returns a non-2xx code when something is genuinely broken. Aim for a response under 500ms — slow health checks mask latency problems and can themselves cause timeout failures.

Here's a minimal example in Python (Flask) that checks a database connection before returning 200:

import time
from flask import Flask, jsonify
from sqlalchemy import text
from db import engine  # your existing SQLAlchemy engine

app = Flask(__name__)

@app.route("/health")
def health():
    checks = {}

    # Database check
    try:
        start = time.monotonic()
        with engine.connect() as conn:
            conn.execute(text("SELECT 1"))
        checks["db"] = {"status": "ok", "latency_ms": round((time.monotonic() - start) * 1000)}
    except Exception as exc:
        checks["db"] = {"status": "error", "detail": str(exc)}

    all_ok = all(v["status"] == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503
    return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), status_code

The critical detail: return 503 when a dependency is down, not 200 with an error body. ELB only reads the HTTP status code. If you return 200 with {"status": "error"} in the JSON, ELB will happily mark the instance healthy.

What dependencies to check

Include checks for anything your app cannot function without. A typical list looks like this:

Primary database — a lightweight SELECT 1 or equivalent ping
Cache layer (Redis, Memcached) — a PING command
Critical third-party APIs — only if your app is completely non-functional without them
Application-level state — for example, whether required config has loaded or background workers are running

Don't check every optional integration. If a payment provider is down but your app can still serve read-only pages, a healthy status is correct. Health checks should reflect whether this instance can serve traffic, not whether the world is perfect.

What not to include

Avoid anything that writes to your database on every health check — ELB can hit your endpoint several times per minute per instance, and that adds up. Also avoid external HTTP calls with long timeouts; a slow health check that occasionally times out will cause unnecessary instance churn.

Configuring the Target Group Health Check Settings

Once your endpoint is solid, the ELB settings control how quickly bad instances are detected and removed. Open your target group in the AWS console and review each setting with intention.

Setting	Default	Recommended starting point
Health check path	/	/health (your new endpoint)
Healthy threshold	5	2–3
Unhealthy threshold	2	2
Timeout	5s	3–4s
Interval	30s	10–15s
Success codes	200	200 (be explicit, avoid ranges)

The interval and unhealthy threshold together determine how long a broken instance stays in rotation. With a 30s interval and an unhealthy threshold of 2, a failing instance can serve traffic for up to 60 seconds before ELB removes it. Drop the interval to 10s and you're down to 20 seconds of blast radius.

The healthy threshold controls how many consecutive successes are needed before a new or recovering instance receives traffic. A value of 2 means a restarted instance is back in rotation after 20 seconds at a 10s interval — fast enough for most deployments without letting flapping instances back in too quickly.

Updating via AWS CLI

If you manage infrastructure as code (and you should), here's the CLI equivalent:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
  --health-check-path /health \
  --health-check-interval-seconds 10 \
  --health-check-timeout-seconds 4 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 2 \
  --matcher HttpCode=200

Commit this to your Terraform or CloudFormation templates so these settings don't drift back to defaults on the next infrastructure update.

The Graceful Shutdown Problem

A related scenario: an instance is healthy according to ELB, starts shutting down (during a deployment or scale-in event), and ELB keeps sending traffic to it for up to 30 seconds because it hasn't failed enough checks yet.

AWS solves this with connection draining (called

Fixing AWS ELB Health Checks That Pass While Your App Serves Errors

What you'll learn

Prerequisites

Why Health Checks Lie

Anatomy of a Useful Health Check Endpoint

What dependencies to check

What not to include

Configuring the Target Group Health Check Settings

Updating via AWS CLI

The Graceful Shutdown Problem

Related Articles

Fixing AWS ElastiCache Redis Evictions That Silently Degrade App Performance

Fixing AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split

Fixing AWS SQS Message Visibility Timeouts That Cause Duplicate Processing

Comments (0)

Leave a Comment

Fixing AWS ELB Health Checks That Pass While Your App Serves Errors

What you'll learn

Prerequisites

Why Health Checks Lie

Anatomy of a Useful Health Check Endpoint

What dependencies to check

What not to include

Configuring the Target Group Health Check Settings

Updating via AWS CLI

The Graceful Shutdown Problem

Related Articles

Fixing AWS ElastiCache Redis Evictions That Silently Degrade App Performance

Fixing AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split

Fixing AWS SQS Message Visibility Timeouts That Cause Duplicate Processing

Comments (0)

Leave a Comment

Stay ahead of the curve