Fixing AWS CloudWatch Alarms That Fire Late or Not at All

Your service goes down at 2:14 AM. Your CloudWatch alarm triggers at 2:41 AM — or never. By the time your on-call engineer is paged, users have already filed support tickets and your error budget is gone. You set up the alarm weeks ago and assumed it worked. It didn't.

Late-firing and silent CloudWatch alarms are one of the most common and quietly damaging infrastructure problems on AWS. The fix is rarely a single setting; it's usually a combination of misconfigured evaluation windows, poor missing-data handling, and alarm topology that doesn't match how your metrics actually behave.

Why CloudWatch alarms delay or stay silent even when a breach is real
How evaluation periods and datapoints-to-alarm interact and where they bite you
How to handle missing data correctly so gaps don't mask outages
Common metric math pitfalls that produce misleading alarm states
How composite alarms reduce noise without hiding critical failures

How CloudWatch Alarm Evaluation Actually Works

Before you can fix a misfiring alarm, you need a clear mental model of how CloudWatch evaluates one. Every alarm is defined by three numbers: the period (how many seconds of data each data point represents), the evaluation period (how many consecutive data points CloudWatch examines), and datapoints-to-alarm (how many of those examined points must breach the threshold to flip the alarm into ALARM state).

A common setup is a 5-minute period with 3 evaluation periods and a datapoints-to-alarm of 3. That means CloudWatch waits for three consecutive 5-minute buckets to all breach the threshold — a full 15 minutes — before it fires. If your spike lasts 12 minutes, the alarm never triggers. This is the single most frequent reason alarms are late or silent.

CloudWatch evaluates alarms approximately once per period. The evaluation is not instantaneous; there is typically a 1–2 minute delay before metric data published by a service is available for alarm evaluation. Factor this in when you're calculating your worst-case detection time.

Choosing the Right Period and Datapoints-to-Alarm

The instinct to use a high datapoints-to-alarm value is understandable — you want to avoid false positives. But it comes at the cost of detection speed. The right balance depends on what you're measuring.

For error rate spikes that indicate an outage, a 1-of-1 or 2-of-3 configuration on a 1-minute period is usually appropriate. For alarms on noisy metrics like CPU utilization where brief spikes are normal, 3-of-5 on a 5-minute period gives you noise filtering without hiding a sustained problem.

The general formula for worst-case detection time is:

worst_case_detection_seconds = period_seconds * evaluation_periods

If you need to catch an outage within 5 minutes, your period multiplied by evaluation periods cannot exceed 300 seconds. A 60-second period with 3 evaluation periods gives you a 3-minute worst case, which is usually acceptable.

Missing Data: The Silent Killer

Missing data is where alarms get quietly broken in ways that are hard to notice until something goes wrong. When a Lambda function has no invocations, CloudWatch publishes no data points for its error metrics. When an EC2 instance is stopped, its memory metric disappears entirely. What happens to an alarm when its metric goes silent?

The answer depends on the treat_missing_data setting, and the default is missing, which means CloudWatch keeps the alarm in its current state. If the alarm was OK before the data stopped, it stays OK — even if the reason data stopped is that your service crashed.

Here are the four options and when to use each:

missing (default) — Maintains current state. Use this only when gaps are expected and benign, such as a scheduled batch job that runs periodically.
notBreaching — Treats missing data as within threshold. Appropriate for metrics like request count where zero traffic is fine.
breaching — Treats missing data as a threshold violation. Use this for heartbeat-style metrics where any silence means something is wrong.
ignore — Skips the missing data point in the evaluation window. Rarely the right choice; it can mask sustained gaps.

The most dangerous misconfiguration is using notBreaching or missing for a metric that you rely on as a proxy for service health. If your application stops emitting a custom metric because it has crashed, you want the alarm to fire, not to stay green.

Fixing the Configuration in AWS CLI and CloudFormation

You can update an existing alarm's missing-data treatment from the CLI without recreating it:

aws cloudwatch put-metric-alarm \
  --alarm-name "api-error-rate" \
  --metric-name "5xxErrorRate" \
  --namespace "MyApp/API" \
  --period 60 \
  --evaluation-periods 3 \
  --datapoints-to-alarm 2 \
  --threshold 1.0 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:pagerduty-topic

In CloudFormation, the equivalent block looks like this:

{
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmName": "api-error-rate",
    "MetricName": "5xxErrorRate",
    "Namespace": "MyApp/API",
    "Period": 60,
    "EvaluationPeriods": 3,
    "DatapointsToAlarm": 2,
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold",
    "TreatMissingData": "breaching",
    "AlarmActions": ["arn:aws:sns:us-east-1:123456789012:pagerduty-topic"]
  }
}

Note the DatapointsToAlarm field. Many teams omit it, which causes it to default to the same value as EvaluationPeriods, requiring every single point in the window to breach. Setting it explicitly gives you the M-of-N evaluation you actually want.

Metric Math Alarms and Their Gotchas

Metric math lets you combine multiple metrics into a single expression for alarming — for example, computing an error rate as errors divided by total requests. This is powerful, but it introduces new failure modes.

The most common problem: if the denominator metric is zero (no requests coming in), the division produces either a NaN or no data point at all, depending on how CloudWatch handles it. Your error rate alarm then enters the missing-data state, and if that's configured as notBreaching, you get a false sense of safety during a complete traffic drop.

Protect against this with the IF function in your metric math expression:

IF(requests > 0, errors / requests * 100, 0)

This returns zero when there are no requests, rather than producing a missing or undefined value. Combine this with a separate alarm on request count so you catch a traffic drop independently.

A second gotcha: when you use metric math, the period of the output metric is determined by the period of the input metrics. If your input metrics have different periods, CloudWatch will align them, but the result can be unexpected. Always confirm that all metrics in a math expression share the same resolution.

High-Resolution Metrics and Sub-Minute Alarms

Standard CloudWatch metrics have a 1-minute minimum granularity. If you need to detect a spike that lasts only 20 seconds, standard metrics will miss it entirely — the data gets aggregated into a 1-minute bucket, and the average smooths out the spike.

High-resolution custom metrics let you publish data at 1-second or 10-second intervals using the StorageResolution parameter in the PutMetricData API call. You can then set alarm periods of 10, 30, or 60 seconds on these metrics.

aws cloudwatch put-metric-data \
  --namespace "MyApp/HighRes" \
  --metric-name "RequestLatencyMs" \
  --value 850 \
  --storage-resolution 10

High-resolution alarms come with a cost: they are priced higher than standard alarms and they increase the volume of data stored. Use them selectively for metrics where sub-minute detection genuinely matters, such as payment processing latency or real-time game servers.

Composite Alarms: Reducing Noise Without Hiding Failures

Composite alarms let you build boolean logic across multiple child alarms. Instead of paging on every individual signal, you define conditions like "alert only when the error rate alarm AND the latency alarm are both in ALARM state." This dramatically reduces alert fatigue while keeping your detection sound.

A common pattern is a tiered composite: one composite alarm for "any signal of trouble" that triggers a low-priority notification, and a second for "multiple signals simultaneously" that triggers your on-call page.

{
  "Type": "AWS::CloudWatch::CompositeAlarm",
  "Properties": {
    "AlarmName": "api-critical-composite",
    "AlarmRule": "ALARM(api-error-rate) AND ALARM(api-latency-p99)",
    "AlarmActions": ["arn:aws:sns:us-east-1:123456789012:pagerduty-critical"]
  }
}

The critical thing to understand about composite alarms: they do not re-evaluate the underlying metrics themselves. They only observe the state of child alarms. This means if a child alarm is misconfigured and stays in OK state despite a real problem, the composite alarm will never fire. Fix the child alarms first; then layer composite alarms on top.

Common Pitfalls to Audit Right Now

These are the configurations most likely to be silently broken in a mature AWS account:

Alarm actions pointing to deleted SNS topics. The alarm transitions state, but no notification fires. Check that each alarm action ARN resolves to an existing topic with at least one subscription.
Alarms left in INSUFFICIENT_DATA state. This happens when a metric has never published data or when the namespace or dimension values are wrong. An alarm in INSUFFICIENT_DATA is not protecting you. Check the alarm history to see if it has ever reached OK or ALARM.
Using account-level CloudWatch metrics for per-resource alarms. For example, alarming on the total Lambda error count for an account instead of per-function. One noisy function can mask another that's silently failing.
Forgetting to alarm on both sides of a metric. Latency too high is bad. But latency suddenly dropping to near zero often means traffic has stopped reaching your service entirely. Alarm on both conditions.
Treating the alarm state as the source of truth. Always cross-check alarm state against the actual metric graph during an incident. Alarm state lags reality by at least one evaluation window.

Testing Your Alarms Before You Need Them

The only way to be confident an alarm works is to test it deliberately. AWS provides a set-alarm-state CLI command that forces an alarm into any state without the metric actually breaching. This lets you confirm that SNS topics, Lambda functions, and PagerDuty integrations are wired up correctly.

aws cloudwatch set-alarm-state \
  --alarm-name "api-error-rate" \
  --state-value ALARM \
  --state-reason "Manual test — verifying SNS integration"

Run this against every new alarm when you create it. Verify that the notification lands in the right channel within an acceptable time. Then reset the state back to OK:

aws cloudwatch set-alarm-state \
  --alarm-name "api-error-rate" \
  --state-value OK \
  --state-reason "Manual test complete"

This does not test whether the metric breach logic is correct, only that the notification path is functional. To test the metric logic, you need to actually generate load or errors against your service and watch the alarm evaluate in real time.

Next Steps

Tightening up your CloudWatch alarms is straightforward once you have a checklist to work through. Here are four concrete actions to take this week:

Audit every alarm's evaluation period and datapoints-to-alarm. Calculate the worst-case detection time for each one and decide if it's acceptable for the metric's role.
Review treat-missing-data settings. Any alarm protecting a service that can crash should use breaching, not the default missing.
Run set-alarm-state tests on all critical alarms. Confirm the notification path is live and landing in the right place before you're relying on it at 2 AM.
Add a request-count or heartbeat alarm alongside every rate-based alarm. This catches the case where the service stops receiving traffic entirely, which a pure error-rate alarm will miss.
Review composite alarm rules to ensure child alarms are healthy. A composite alarm built on a broken child is itself broken.

Fixing AWS CloudWatch Alarms That Fire Late or Not at All

How CloudWatch Alarm Evaluation Actually Works

Choosing the Right Period and Datapoints-to-Alarm

Missing Data: The Silent Killer

Fixing the Configuration in AWS CLI and CloudFormation

Metric Math Alarms and Their Gotchas

High-Resolution Metrics and Sub-Minute Alarms

Composite Alarms: Reducing Noise Without Hiding Failures

Common Pitfalls to Audit Right Now

Testing Your Alarms Before You Need Them

Next Steps

Related Articles

Fixing DigitalOcean Managed Postgres Failover Gaps That Drop Connections

Recovering from a Botched Terraform Apply That Left State Half-Updated

Fixing AWS RDS Connection Pool Exhaustion During Sudden Traffic Spikes

Comments (0)

Leave a Comment

Fixing AWS CloudWatch Alarms That Fire Late or Not at All

How CloudWatch Alarm Evaluation Actually Works

Choosing the Right Period and Datapoints-to-Alarm

Missing Data: The Silent Killer

Fixing the Configuration in AWS CLI and CloudFormation

Metric Math Alarms and Their Gotchas

High-Resolution Metrics and Sub-Minute Alarms

Composite Alarms: Reducing Noise Without Hiding Failures

Common Pitfalls to Audit Right Now

Testing Your Alarms Before You Need Them

Next Steps

Related Articles

Fixing DigitalOcean Managed Postgres Failover Gaps That Drop Connections

Recovering from a Botched Terraform Apply That Left State Half-Updated

Fixing AWS RDS Connection Pool Exhaustion During Sudden Traffic Spikes

Comments (0)

Leave a Comment

Stay ahead of the curve