Fixing ECS Task Failures When Secrets Manager Timeouts Block Container Startup
Your ECS service won't start. The task cycles through PROVISIONING β PENDING β STOPPED in seconds, and the only clue in the console is a cryptic ResourceInitializationError. Nine times out of ten, Secrets Manager is the bottleneck β the container agent tried to fetch your secrets before startup and timed out waiting for a response that never came.
This article walks through every common cause of that timeout, how to verify which one you're hitting, and the exact fix for each.
What You'll Learn
- How ECS injects secrets at startup and why a timeout stops the whole task
- Where to find the actual error message buried in ECS and CloudWatch logs
- How to fix networking issues that block Secrets Manager access (VPC endpoints, NAT)
- How to audit and repair execution role permissions in under five minutes
- How to handle throttling and malformed ARNs before they bite you in production
What Actually Happens at ECS Task Startup
Before your container process runs a single line of code, the ECS container agent performs a secrets injection step. When your task definition references a secret like this:
{
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password-AbCdEf"
}
]
}
...ECS calls the Secrets Manager API on behalf of your task, resolves the secret value, and injects it as an environment variable into the container. This all happens during the PENDING phase, before your application code starts. If that API call fails β whether due to a network issue, a permissions error, or a malformed ARN β ECS marks the task as STOPPED immediately and your container never runs.
The timeout window is short (roughly 2 minutes for Fargate tasks). If Secrets Manager doesn't respond within that window, ECS gives up.
Reading the Right Logs: Where the Error Hides
The ECS console gives you a one-line error, but the useful detail is in two places.
Stopped task reason in the ECS console
Go to your cluster, click the stopped task, and expand the Stopped reason field. You'll see something like:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 5 time(s): failed to fetch secret arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password-AbCdEf: RequestError: send request failed caused by: Post "https://secretsmanager.us-east-1.amazonaws.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
That last part β context deadline exceeded or Client.Timeout exceeded β confirms a network-level timeout, not a permissions denial. A permissions error looks different; it says AccessDeniedException.
CloudWatch Logs for the ECS agent
On EC2 launch type (not Fargate), the ECS agent logs at /var/log/ecs/ecs-agent.log on the instance. SSH in and grep for your task ID or the string asm to find the relevant lines. On Fargate, you're limited to the stopped task reason in the console, so make sure you're reading it carefully before the task record expires.
Root Cause 1: No VPC Endpoint and No NAT Gateway
This is the most common cause on Fargate tasks running in private subnets. Secrets Manager is a public AWS API endpoint. If your task runs in a private subnet with no NAT gateway and no VPC interface endpoint for Secrets Manager, the API call has nowhere to go.
How to verify
Check your subnet's route table. If the default route (0.0.0.0/0) points nowhere useful β no NAT gateway, no internet gateway β outbound traffic to AWS public endpoints is simply dropped.
Fix option A: Add a VPC endpoint (recommended for private subnets)
Create an Interface VPC Endpoint for com.amazonaws.us-east-1.secretsmanager in your VPC. Attach it to the subnets your ECS tasks use, and attach a security group that allows HTTPS (port 443) inbound from your task's security group.
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc1234 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.secretsmanager \
--subnet-ids subnet-0aaa1111 subnet-0bbb2222 \
--security-group-ids sg-0endpoint1234 \
--private-dns-enabled
The --private-dns-enabled flag is critical. Without it, your task resolves secretsmanager.us-east-1.amazonaws.com to the public IP and the traffic still tries to leave the VPC. With private DNS enabled, the hostname resolves to the endpoint's private IP inside your VPC.
Fix option B: Add a NAT gateway
If your architecture already uses a NAT gateway for other outbound traffic, make sure your ECS task's subnet routes through it. A common mistake is putting the task in a subnet whose route table doesn't include the NAT gateway entry, even though one exists in the VPC.
Note that if you also pull images from ECR, you'll need VPC endpoints for ecr.api, ecr.dkr, and s3 (for the image layers) too β or the same NAT gateway covers all of them at once.
Root Cause 2: Execution Role Missing Secrets Manager Permissions
The ECS execution role (not the task role) is what the container agent uses to fetch secrets. These are two different IAM roles, and it's easy to add permissions to the wrong one.
How to verify
In the ECS console, open your task definition and note the Task execution role ARN. Open that role in IAM and check its attached policies. You need at minimum:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/*"
}
]
}
Scope the Resource to your specific secret ARN or a path prefix β don't use * in production. If the error message says AccessDeniedException rather than a timeout, this is your problem.
The gotcha: secret ARN suffix
Secrets Manager appends a random 6-character suffix to every secret ARN (e.g., -AbCdEf). If your IAM policy specifies the exact full ARN without a wildcard, and you delete and recreate the secret, the suffix changes and your policy no longer matches. Use a trailing wildcard on the suffix:
arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password-*
Root Cause 3: The Secret ARN or Name Is Wrong
A typo in the valueFrom field produces a different error than a timeout, but it's worth checking if you've ruled out networking and permissions. The error usually reads ResourceNotFoundException.
Verify the exact ARN by running:
aws secretsmanager describe-secret \
--secret-id prod/db-password \
--region us-east-1
Copy the ARN field from the output verbatim into your task definition. Don't rely on constructing the ARN by hand β the suffix will always catch you out.
You can also reference secrets by name instead of ARN in your task definition, but using the full ARN is more explicit and avoids cross-region ambiguity.
Root Cause 4: KMS Key Permissions Are Missing
If your secret is encrypted with a customer-managed KMS key (not the default AWS-managed key), the execution role also needs permission to use that key for decryption. Secrets Manager calls KMS under the hood when resolving the secret value.
Add the following to your execution role's policy:
{
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/your-key-id"
}
You may also need to update the KMS key policy itself to allow the execution role as a key user, depending on how the key policy is structured. Check the key policy in the KMS console under Key users.
Root Cause 5: Secrets Manager Throttling
Secrets Manager has API rate limits. If you're running many tasks simultaneously β say, during a large-scale deployment or an autoscaling event β the burst of GetSecretValue calls can hit the service quota and cause retries that eat into the startup timeout window.
How to detect throttling
Open CloudWatch metrics for Secrets Manager and look at the ThrottledRequests metric in the AWS/SecretsManager namespace. A spike coinciding with your deployment confirms throttling.
Mitigation strategies
- Cache secrets at the application layer. Fetch the secret once at startup, not on every request. The AWS Secrets Manager caching clients for Python, Java, and Go handle this automatically.
- Stagger deployments. Instead of replacing all tasks at once, use a rolling update with a lower
maximumPercentso tasks start in batches. - Use Parameter Store for high-read secrets. SSM Parameter Store has higher default throughput limits. For non-sensitive config that changes rarely, it can relieve pressure on Secrets Manager.
- Request a quota increase. For sustained high-scale workloads, open a support case to raise the Secrets Manager API TPS limit for your account.
Common Pitfalls to Avoid
Mixing up the task role and the execution role
The task execution role is used by the ECS agent during startup (pulling images, fetching secrets). The task role is used by your running application code (calling S3, DynamoDB, etc.). Secrets Manager permissions for startup injection belong on the execution role. Putting them only on the task role won't help β the agent doesn't use the task role during PENDING.
Forgetting the security group on your VPC endpoint
Interface VPC endpoints have their own security groups. If you create the endpoint but its security group doesn't allow inbound HTTPS from your task's security group, the connection will be refused. This produces a timeout, not a clear error message. Always explicitly allow TCP 443 from the task security group to the endpoint security group.
Using secrets vs environment in the task definition
Values listed under environment are stored in plaintext in the task definition and are visible to anyone with IAM access to ecs:DescribeTaskDefinition. Always use the secrets block for sensitive values. The injection mechanism is different, and only the secrets block triggers the Secrets Manager API call at startup.
Neglecting to register a new task definition revision after fixing the role
IAM changes take effect immediately, but ECS tasks cache the execution role credentials for the duration of the task. If you update the execution role policy and then force a new deployment, your new tasks will pick up the updated permissions. Running tasks don't retroactively benefit from role changes β they need to be replaced.
For teams building reliable deployment pipelines, the same discipline applies when thinking about how your release process triggers these rollouts. Consistent, automated deployment steps reduce the chance of a stale task definition slipping through β the same principle behind automating semantic versioning in OSS releases applies here: remove the manual step, remove the error.
Next Steps
Here's a concrete checklist to work through the next time an ECS task dies on startup with a Secrets Manager error:
- Read the full stopped task reason in the ECS console. Distinguish between
context deadline exceeded(network) andAccessDeniedException(permissions) andResourceNotFoundException(bad ARN). - Check your subnet's route table. If your task is in a private subnet, confirm you have either a NAT gateway route or a Secrets Manager VPC endpoint with private DNS enabled.
- Audit the execution role β not the task role β for
secretsmanager:GetSecretValueand, if needed,kms:Decryptpermissions scoped to the correct resources. - Verify the secret ARN verbatim using
aws secretsmanager describe-secretand paste it directly into your task definition β don't construct the ARN by hand. - Monitor ThrottledRequests in CloudWatch if you're seeing intermittent failures during large deployments, and consider staggered rollouts or a secrets caching library to reduce API call volume.
Frequently Asked Questions
Why does my ECS Fargate task stop immediately with ResourceInitializationError on startup?
ResourceInitializationError during ECS task startup usually means the container agent failed to fetch a secret or pull the container image before the task started. The most common cause is that the task's private subnet has no network path to Secrets Manager β either no NAT gateway or no VPC interface endpoint for Secrets Manager with private DNS enabled.
How do I give an ECS task permission to read from Secrets Manager?
You need to add secretsmanager:GetSecretValue permission to the ECS task execution role, not the task role. The execution role is used by the ECS agent during the PENDING phase to inject secrets as environment variables before your container starts running.
What is the difference between the ECS task role and the task execution role?
The task execution role is used by the ECS agent itself during startup to pull container images and fetch secrets. The task role is assumed by your application code once the container is running, for example to call S3 or DynamoDB. Secrets injection permissions must go on the execution role.
Can a VPC endpoint for Secrets Manager fix ECS container startup timeouts?
Yes. If your ECS tasks run in a private subnet without a NAT gateway, creating an Interface VPC Endpoint for Secrets Manager with private DNS enabled gives the container agent a network path to the Secrets Manager API without requiring internet access. You also need to ensure the endpoint's security group allows HTTPS from your task's security group.
Why does Secrets Manager return a timeout instead of an access denied error in ECS?
A timeout means the network request to the Secrets Manager API endpoint never received a response β the packet was dropped or the connection was refused before AWS could evaluate permissions. An AccessDeniedException means the network reached Secrets Manager but the IAM policy denied the request. Always resolve timeouts first by checking networking, then address permission errors separately.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!