Fixing AWS CloudWatch Alarms Stuck in INSUFFICIENT_DATA After Deployment
You deploy a new service or update an existing one, check your CloudWatch alarms, and every single one is sitting in INSUFFICIENT_DATA. No OK, no ALARM — just that maddening gray state that tells you nothing about whether your system is healthy. And it stays there.
This is rarely a CloudWatch bug. It means the alarm cannot find the metric data it was configured to evaluate. Something between your application and the alarm definition broke, and the fix is almost always traceable to one of a handful of root causes.
What INSUFFICIENT_DATA Actually Means
CloudWatch transitions an alarm to INSUFFICIENT_DATA when it does not have enough data points within the evaluation window to make a determination. This can happen at alarm creation (expected, briefly), but if it persists beyond the evaluation period after a deployment, something is wrong.
The alarm is not broken in the sense that CloudWatch is failing — it is simply looking for a metric that either does not exist at the namespace/dimension combination it expects, has not received data recently, or is gated behind a permission problem. The gray state is a symptom, not the disease.
What You'll Learn
- How to confirm whether your metric is actually being published post-deployment.
- How namespace and dimension mismatches silently cause this state.
- Which IAM permissions are required for custom metric ingestion.
- How evaluation period and datapoints-to-alarm settings interact with sparse metrics.
- Concrete CLI commands to diagnose and verify each fix.
Prerequisites
This guide assumes you are comfortable with the AWS CLI, have read access to CloudWatch in your account, and are working with either AWS-native metrics (EC2, ECS, RDS, etc.) or custom metrics published via the PutMetricData API. Code examples use the AWS CLI v2 and the AWS SDK for Python (boto3) where relevant.
Why Deployments Commonly Trigger This State
A fresh deployment is one of the most reliable ways to land in INSUFFICIENT_DATA because several things change at once. The metric source (your application, an EC2 instance, an ECS task) gets replaced or restarted, creating a gap in the metric stream. If your new deployment also changes the code that publishes custom metrics — even slightly — the namespace or dimension values can drift from what the alarm expects.
Blue-green or rolling deployments add another wrinkle: the old fleet stops publishing while the new one starts, and if the transition takes longer than the alarm's evaluation window, CloudWatch may see zero data points and switch to INSUFFICIENT_DATA before the new instances have had time to emit anything. This is expected during the transition, but if it does not self-heal within a few minutes of the deployment completing, you have a real problem.
If your deployment pipeline touches IAM roles or policies — which is common when rotating task roles in ECS or updating Lambda execution roles — a missing cloudwatch:PutMetricData permission can silently stop all custom metric ingestion from that point forward. If you have ever dealt with a CodeDeploy rollback that left your fleet in a split state, you already know how subtle these mid-deployment IAM issues can be.
Verify the Metric Is Actually Being Published
Start here before touching the alarm configuration. If the metric is not arriving at CloudWatch, fixing the alarm definition will not help.
Use the CLI to pull recent data points directly:
aws cloudwatch get-metric-statistics \
--namespace "YourApp/Metrics" \
--metric-name "RequestCount" \
--dimensions Name=ServiceName,Value=api-service \
--start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Sum
If the Datapoints array comes back empty, the metric is not being published — or it is being published under a different namespace or dimension set. If it returns data, skip to the namespace/dimension section, because the alarm definition is likely misaligned with what is being published.
For AWS-managed metrics (like EC2 CPUUtilization), check that the instance or resource is actually running and has been running long enough for the monitoring agent to emit data. A new EC2 instance with detailed monitoring disabled will only emit metrics every five minutes. If your alarm's evaluation period is shorter than that, you may never accumulate enough data points.
Check Namespace and Dimension Mismatches
This is the single most common cause of persistent INSUFFICIENT_DATA after a deployment. CloudWatch alarms are bound to an exact namespace and dimension set. If your application publishes to MyApp/API but the alarm targets MyApp/Api (note the casing), CloudWatch treats these as completely different metrics.
Fetch the alarm's current configuration to see exactly what it is looking for:
aws cloudwatch describe-alarms \
--alarm-names "api-request-count-alarm" \
--query 'MetricAlarms[0].{Namespace:Namespace,MetricName:MetricName,Dimensions:Dimensions}'
Then list what is actually being published in the last hour:
aws cloudwatch list-metrics \
--namespace "YourApp/Metrics" \
--metric-name "RequestCount"
Compare the Dimensions array character-by-character. A dimension key of service_name versus ServiceName is a mismatch. So is a value of api-service-v2 when the alarm expects api-service.
If your deployment changed how the application constructs dimension values — for example, by including a version tag or environment name that was not there before — update the alarm to match the new dimension set, or update the application code to preserve the original dimension format.
You can update an alarm's metric target without recreating it:
aws cloudwatch put-metric-alarm \
--alarm-name "api-request-count-alarm" \
--namespace "YourApp/Metrics" \
--metric-name "RequestCount" \
--dimensions Name=ServiceName,Value=api-service \
--statistic Sum \
--period 60 \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching
Confirm the Evaluation Period and Datapoints-to-Alarm Settings
Even when the metric is publishing correctly, an overly strict evaluation configuration can cause alarms to linger in INSUFFICIENT_DATA when metrics are sparse or have gaps during deployment.
The key interaction to understand: EvaluationPeriods defines how many periods CloudWatch looks back, and DatapointsToAlarm defines how many of those periods must have data before the alarm can evaluate. If your metric only emits data when there is traffic (a request-count metric, for example), and your deployment caused a brief traffic gap, all periods in the evaluation window might be empty.
The --treat-missing-data flag controls what happens in that scenario. The options are:
- missing (default): the alarm stays in
INSUFFICIENT_DATAif it cannot find enough data points. - notBreaching: missing data is treated as within threshold. Useful for request-rate or error-rate alarms during low-traffic windows.
- breaching: missing data triggers the alarm. Use this for alarms that should fire if the metric disappears entirely.
- ignore: the alarm's last known state is preserved. Useful when metric gaps are expected and you want to avoid state churn.
For most post-deployment scenarios, setting --treat-missing-data notBreaching on throughput or error-rate alarms is the right call. For alarms that monitor heartbeat-style metrics (a metric that should always be non-zero if the service is up), use breaching so a metric gap triggers the alarm rather than silencing it.
IAM Permissions Blocking Metric Ingestion
Custom metrics published via PutMetricData require the calling identity to have cloudwatch:PutMetricData in its IAM policy. If your deployment rotated the IAM role attached to your ECS task, Lambda function, or EC2 instance profile, and the new role is missing this permission, metric publishing will silently fail — no error surfaced to the user, no exception in most SDK implementations, and no data in CloudWatch.
Check the effective permissions for the role your service is using:
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/YourTaskRole \
--action-names cloudwatch:PutMetricData \
--resource-arns "*"
If the result shows implicitDeny or explicitDeny, add the permission to the role's policy. The minimum required statement looks like this:
{
"Effect": "Allow",
"Action": "cloudwatch:PutMetricData",
"Resource": "*"
}
Note that PutMetricData does not support resource-level permissions — the resource must be *. Attempting to scope it to a specific namespace ARN will cause the policy to have no effect.
Also verify your application is not swallowing the error. Many CloudWatch SDK wrappers catch exceptions internally and log a warning without re-throwing. Add explicit error handling around your put_metric_data calls during debugging:
import boto3
from botocore.exceptions import ClientError
client = boto3.client("cloudwatch", region_name="us-east-1")
try:
client.put_metric_data(
Namespace="YourApp/Metrics",
MetricData=[
{
"MetricName": "RequestCount",
"Dimensions": [{"Name": "ServiceName", "Value": "api-service"}],
"Value": 1.0,
"Unit": "Count",
}
],
)
except ClientError as e:
print(f"Failed to publish metric: {e.response['Error']['Code']} - {e.response['Error']['Message']}")
raise
This is the same pattern worth applying when you are chasing issues like ECS task failures where API calls block silently during container startup — make the error surface early rather than hunting it downstream.
Alarm Configuration Pitfalls to Watch For
Region mismatch
CloudWatch alarms exist in a specific region and can only evaluate metrics in the same region. If your deployment moved a service to a different region (or if Terraform/CDK deployed the alarm to the wrong region), the metric and alarm will never meet. Always confirm both the alarm and the metric resource are in the same region using --region flags on your CLI commands.
Metric math alarms with one broken expression
If your alarm uses a metric math expression that references multiple metrics, the entire expression returns INSUFFICIENT_DATA if any one of the input metrics has no data. Identify which sub-metric is missing by querying each one independently with get-metric-statistics before assuming the expression itself is wrong.
EC2 detailed monitoring not enabled
Basic EC2 monitoring publishes metrics at five-minute intervals. If your alarm's period is 60 seconds with a three-period evaluation window, you will never have enough data points. Either enable detailed monitoring on the instance (which drops resolution to one minute) or increase the alarm's period to 300 seconds to match the basic monitoring cadence.
Alarm created before the first metric data point
If Infrastructure-as-Code (CDK, Terraform, CloudFormation) creates the alarm before your application has published any metric data — which is typical — the alarm will be in INSUFFICIENT_DATA initially. This is expected and will self-resolve once data points arrive. Wait at least two full evaluation periods before concluding there is a problem. For a three-period alarm with a 60-second period, that is a three-minute wait after the first data point appears.
Cross-account metric alarms
If you are monitoring metrics from a different AWS account, CloudWatch requires a cross-account sharing setup via CloudWatch Observability Access Manager. Missing this setup means the alarm simply cannot see the metric, and it will stay in INSUFFICIENT_DATA indefinitely. Check whether the source account has sharing enabled and that your account is listed as a monitoring account.
Common Pitfalls and Edge Cases
One subtle issue: CloudWatch retains metric data for different durations depending on resolution. High-resolution custom metrics (sub-60-second) are retained for three hours. Standard-resolution metrics are retained for 15 months. If your alarm uses a very short period (10 or 30 seconds), and no data has been published in the last three hours, the alarm will have no data to evaluate. This is easy to miss after a long deployment window or a maintenance period.
Another common trap: composite alarms that depend on child alarms in INSUFFICIENT_DATA. A composite alarm treats INSUFFICIENT_DATA from a child alarm as neither ALARM nor OK, which can leave the composite in INSUFFICIENT_DATA too. Fix the child alarms first before investigating the composite.
If you are running ECS tasks behind an ALB with health-check flapping, be aware that task restarts create gaps in CloudWatch metric streams. Your ECS task-level metrics (like memory and CPU) will have missing data points every time a task is replaced, which can cause alarms watching those metrics to dip into INSUFFICIENT_DATA during rolling deployments. Setting treat-missing-data to ignore or notBreaching is usually the right move for those alarms.
Finally, check your CloudWatch alarm history — it often tells you exactly when the state changed and why:
aws cloudwatch describe-alarm-history \
--alarm-name "api-request-count-alarm" \
--history-item-type StateUpdate \
--start-date $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ)
The HistorySummary field in the output will often include a human-readable explanation like "Threshold Crossed: no datapoints were received for N periods" — which points you directly at the evaluation-period problem versus a metric publishing problem.
If your setup involves VPC endpoints for CloudWatch, also verify routing is correct. A broken VPC endpoint can silently drop PutMetricData calls from within a private subnet. The pattern is similar to what you would see with VPC endpoint routing failures that silently break S3 access — the call appears to succeed in the application but never reaches the service.
Next Steps
Work through this checklist in order to resolve INSUFFICIENT_DATA systematically:
- Run
get-metric-statisticsto confirm the metric is publishing and matches the exact namespace, metric name, and dimensions the alarm expects. - Run
iam simulate-principal-policyfor the service's IAM role to confirmcloudwatch:PutMetricDatais allowed. - Review the alarm's
EvaluationPeriods,DatapointsToAlarm, andTreatMissingDatasettings; adjusttreat-missing-datatonotBreachingfor throughput alarms. - Check
describe-alarm-historyto read CloudWatch's own explanation of the state transition. - If the metric is correct and permissions are valid, enable detailed monitoring on EC2 instances or increase the alarm's period to match the metric's actual resolution.
Frequently Asked Questions
How long should I wait for a CloudWatch alarm to leave INSUFFICIENT_DATA after a new deployment?
Wait at least two full evaluation periods after the first metric data point appears. For a three-period alarm with a 60-second period, that means roughly three to four minutes after your application starts publishing metrics. If it has not resolved by then, the problem is not a timing issue.
Can a CloudWatch alarm go to INSUFFICIENT_DATA even when the metric is publishing data?
Yes. The most common reason is a namespace or dimension mismatch — the alarm is looking for a slightly different namespace string or dimension key/value than what the application actually publishes. Use list-metrics and describe-alarms together to compare them character-by-character.
Does setting treat-missing-data to notBreaching hide real problems?
It depends on what the metric represents. For request-rate or throughput metrics that are zero during low-traffic windows, notBreaching is appropriate. For heartbeat-style metrics that should always have data if the service is healthy, use breaching so a gap in data triggers the alarm rather than silencing it.
Why would a custom metric published by an ECS task show no data in CloudWatch?
The most likely cause is a missing cloudwatch:PutMetricData permission on the ECS task role. The SDK call may return silently without raising an exception, so the application appears healthy while no metrics reach CloudWatch. Use iam simulate-principal-policy to verify the permission and add explicit error handling around put_metric_data calls.
Do CloudWatch composite alarms stay in INSUFFICIENT_DATA if a child alarm is in INSUFFICIENT_DATA?
Yes. A composite alarm cannot evaluate its rule expression if any referenced child alarm is in INSUFFICIENT_DATA because that state is treated as neither ALARM nor OK. Fix the child alarms first; the composite alarm will resolve on its own once all children have a definitive state.
📤 Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!