ECR Image Pull Throttling: Fix ECS Deployment Failures

Your ECS deployment is rolling out fine, then a wave of tasks fails with CannotPullContainerError and the whole release stalls. The container image hasn't changed, your ECR repository is intact, and your IAM permissions look correct. The culprit is often ECR's per-account pull rate limits — and once you hit them, every subsequent task in the same burst fails in a cascade.

This article walks you through why throttling happens, how to confirm it's the actual cause, and the concrete steps you can take to prevent it from happening again.

What you'll learn

How ECR enforces pull rate limits and where they bite ECS most
How to read CloudWatch metrics and ECS events to confirm throttling
How to use VPC endpoints to reduce latency and avoid public throttle paths
How to configure image caching at the task and cluster level
Operational patterns that keep deployments stable under high concurrency

Prerequisites

You should be familiar with ECS task definitions, ECR repositories, and basic IAM concepts. You'll also want access to CloudWatch Logs and the AWS CLI. The examples use the AWS CLI v2 and assume you're working in a single-region setup first.

Why ECR Throttles Pull Requests

ECR is an authenticated, managed registry backed by S3, but it enforces per-account API rate limits on GetAuthorizationToken, BatchGetImage, and the actual layer download requests. When you deploy or scale dozens of ECS tasks simultaneously, each task independently authenticates and pulls the image. A deployment of 50 tasks across 50 instances can generate hundreds of API calls in a few seconds.

AWS applies soft limits that scale with account activity over time, but a sudden burst — a blue-green swap, an auto-scaling event, or a hotfix rollout — can exceed what the account has been granted. The registry responds with HTTP 429 or 503 errors, and ECS surfaces this as a CannotPullContainerError with a message referencing a throttle or timeout.

Confirming the Cause

Before you change anything, confirm that throttling is actually the problem. There are three places to look.

ECS Service Events

Open the ECS console, navigate to your service, and click the Events tab. Look for lines like:

service my-service was unable to place a task because no container instance met all of its requirements
CannotPullContainerError: ... 429 Too Many Requests

The 429 code is the clearest signal. A generic timeout without a status code could still be throttling, but check network connectivity first.

CloudWatch Metrics

ECR exposes API metrics under the AWS/ECR namespace. Check RepositoryPullCount and look for ThrottledRequests in the API Gateway or ECR-level metrics depending on how your account is instrumented. You can also query CloudTrail for BatchGetImage events with error codes of ThrottlingException.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=BatchGetImage \
  --start-time 2024-01-15T10:00:00Z \
  --end-time 2024-01-15T11:00:00Z \
  --query 'Events[?contains(CloudTrailEvent, `ThrottlingException`)]'

Container Agent Logs

On EC2-backed clusters, the ECS container agent logs at /var/log/ecs/ecs-agent.log include the raw HTTP errors from ECR. SSH into a failing instance and grep for the task ID or the repository URI to find the exact response codes.

Use a VPC Endpoint for ECR

By default, ECS tasks on instances in a private subnet route ECR traffic through a NAT gateway and out to the public ECR endpoint. This path adds latency, costs money per GB, and shares throttle headroom with other traffic paths in your account. A VPC endpoint short-circuits all of that.

You need two interface endpoints and one gateway endpoint:

com.amazonaws.REGION.ecr.api — for ECR API calls (auth, image manifest)
com.amazonaws.REGION.ecr.dkr — for Docker layer downloads
com.amazonaws.REGION.s3 — gateway endpoint for the S3 bucket backing ECR layers

# Create the ECR API endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-0def456 subnet-0ghi789 \
  --security-group-ids sg-0jkl012

# Create the ECR Docker endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-0def456 subnet-0ghi789 \
  --security-group-ids sg-0jkl012

# Create the S3 gateway endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Gateway \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0mno345

VPC endpoints route traffic over the AWS private network, which tends to be faster and more reliable than the NAT path. More importantly, internal endpoint traffic may be subject to separate or higher rate limits than the public endpoint.

Enable Image Caching on EC2 Launch Type

Every time a new EC2 container instance registers and starts pulling images, it downloads everything from scratch unless the image layers are already cached on the host's Docker daemon. You can reduce pull volume by keeping instances warm and avoiding unnecessary image churn.

Avoid Forcing a Fresh Pull

In your task definition, check the imagePullPolicy behavior. ECS on EC2 uses the Docker daemon's layer cache by default. If you're tagging images with :latest, the daemon may still check the registry for updates on every task start. Switch to immutable image tags (a Git commit SHA or a build number) so Docker can skip the manifest check when the image is already present.

{
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:a3f8c21"
    }
  ]
}

ECS Image Pull Behavior Setting

In newer versions of the ECS container agent, you can set ECS_IMAGE_PULL_BEHAVIOR in the agent's environment to control whether it prefers the cached image. Set it to prefer-cached to tell the agent to skip the registry pull entirely if the image is already on disk.

# Add to /etc/ecs/ecs.config on the EC2 instance
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached

This is particularly effective for stable base images. Be careful with it on images that change frequently — you'll need a strategy to invalidate the cache when you deploy a new version.

Use Fargate and Pull Caching Correctly

Fargate tasks don't share a Docker layer cache between tasks because each task runs in its own isolated microVM. Every Fargate task launch pulls the full image from ECR. At scale, this is the primary driver of pull throttling on Fargate clusters.

Use Smaller Images

Fewer layers mean fewer API calls and less data to transfer. A distroless or Alpine-based image pulls faster and makes fewer BatchGetImage requests than a full Debian image with dozens of layers. Consolidate RUN instructions in your Dockerfile to reduce layer count.

# Before: many layers
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean

# After: one layer
RUN apt-get update && apt-get install -y curl && apt-get clean

Stagger Deployments

If you're replacing a fleet of 100 tasks at once, the burst of concurrent pulls peaks exactly when you can least afford failures. Use ECS rolling update settings to limit how many new tasks start simultaneously.

{
  "deploymentConfiguration": {
    "minimumHealthyPercent": 75,
    "maximumPercent": 125,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}

Setting maximumPercent to 125 on a 100-task service means at most 25 new tasks start before old ones drain. Combined with minimumHealthyPercent at 75, you keep the deployment safe but spread the image pulls over a longer window.

IAM and Authentication Optimization

Each ECS task that needs to pull from ECR calls GetAuthorizationToken to get a temporary 12-hour credential. At high task launch rates, these token requests add to the API call volume. A few adjustments can reduce this overhead.

Use the ECS Task Execution Role Correctly

Make sure your task execution role has ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability, ecr:GetDownloadUrlForLayer, and ecr:BatchGetImage permissions. Missing permissions cause retries, which compound the call volume problem.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}

Note that GetAuthorizationToken is an account-level action and cannot be scoped to a specific repository ARN — "Resource": "*" is required for that one action.

Request a Limit Increase

If you're consistently hitting throttle limits during normal operations, open a support ticket and request a service quota increase for ECR API rate limits. AWS Support can raise the per-account limits for accounts with predictable high-volume pull patterns. Document your deployment frequency and task count when making the request — the more specific you are, the faster the increase is approved.

Common Pitfalls

Using :latest tags in production. This forces a registry check on every task start, multiplying API calls even when nothing has changed. Use immutable tags tied to your build system.

Not enabling the S3 gateway endpoint alongside ECR interface endpoints. ECR layers are stored in S3. Without the S3 endpoint, layer downloads still exit through NAT even if your API calls go through the interface endpoint. All three endpoints are required for fully private pulls.

Setting maximumPercent too high. A value of 200 on a 50-task service means 50 new tasks can all start at once. That's 50 simultaneous full image pulls. Pull this back to 125–150 to spread the load.

Ignoring cross-region pulls. If your ECS cluster is in us-west-2 but you're pulling from an ECR repository in us-east-1, you're paying cross-region data transfer costs and adding latency on every pull. Replicate images to the region where your cluster runs.

Skipping the deployment circuit breaker. Without it, a throttled deployment will keep retrying indefinitely, generating even more pull requests and deepening the throttle hole. Enable the circuit breaker so ECS stops after a configurable failure threshold.

Next Steps

Throttling is preventable with the right architecture. Here's where to start:

Check your CloudTrail logs for ThrottlingException on BatchGetImage to confirm the problem before making changes.
Create VPC interface and gateway endpoints for ECR and S3 in every VPC that runs ECS workloads.
Switch all production task definitions from :latest to immutable content-addressed tags.
Tune minimumHealthyPercent and maximumPercent in your deployment configuration to stagger task launches during rollouts.
If you run EC2-backed clusters with stable images, set ECS_IMAGE_PULL_BEHAVIOR=prefer-cached and measure how much pull traffic drops.

Stopping ECR Image Pull Throttling From Killing Your ECS Deployments

What you'll learn

Prerequisites

Why ECR Throttles Pull Requests

Confirming the Cause

ECS Service Events

CloudWatch Metrics

Container Agent Logs

Use a VPC Endpoint for ECR

Enable Image Caching on EC2 Launch Type

Avoid Forcing a Fresh Pull

ECS Image Pull Behavior Setting

Use Fargate and Pull Caching Correctly

Use Smaller Images

Stagger Deployments

IAM and Authentication Optimization

Use the ECS Task Execution Role Correctly

Request a Limit Increase

Common Pitfalls

Next Steps

Related Articles

Pinpointing Terraform State Drift That Breaks Deploys Without Warning

Tracing Silent DigitalOcean Droplet OOM Kills Before They Down Your App

Debugging ALB 502 Errors That Vanish Before Your Logs Capture Them

Comments (0)

Leave a Comment

Stopping ECR Image Pull Throttling From Killing Your ECS Deployments

What you'll learn

Prerequisites

Why ECR Throttles Pull Requests

Confirming the Cause

ECS Service Events

CloudWatch Metrics

Container Agent Logs

Use a VPC Endpoint for ECR

Enable Image Caching on EC2 Launch Type

Avoid Forcing a Fresh Pull

ECS Image Pull Behavior Setting

Use Fargate and Pull Caching Correctly

Use Smaller Images

Stagger Deployments

IAM and Authentication Optimization

Use the ECS Task Execution Role Correctly

Request a Limit Increase

Common Pitfalls

Next Steps

Related Articles

Pinpointing Terraform State Drift That Breaks Deploys Without Warning

Tracing Silent DigitalOcean Droplet OOM Kills Before They Down Your App

Debugging ALB 502 Errors That Vanish Before Your Logs Capture Them

Comments (0)

Leave a Comment

Stay ahead of the curve