Fixing AWS ECS Task Networking Failures in awsvpc Mode

June 02, 2026 8 min read 40 views
Abstract network diagram with interconnected nodes and one highlighted broken connection on a soft blue gradient background

Your ECS service is deployed, health checks pass, but requests randomly time out or drop β€” and CloudWatch shows nothing obviously wrong. The culprit is often the awsvpc network mode itself, which hands each task its own Elastic Network Interface (ENI) and brings a distinct set of failure modes that differ from the classic bridge mode you may be more familiar with.

The good news is that these failures follow recognizable patterns. Once you know what to look for, most of them resolve in under an hour.

What you'll learn

  • How awsvpc mode works and why it behaves differently from bridge mode
  • The most common causes of dropped packets and connection timeouts
  • How to check ENI limits, security group rules, and route table configuration
  • How to debug DNS resolution failures inside a task
  • A repeatable checklist to clear networking issues before they reach production

Prerequisites

You should be comfortable with the AWS console and have the AWS CLI installed. The examples below assume you have ecs-cli or standard aws CLI access, and that you can exec into a running task (ECS Exec must be enabled on your cluster). A basic understanding of VPC concepts β€” subnets, route tables, security groups β€” is assumed.

How awsvpc Mode Actually Works

In bridge mode, all containers on a host share the host's ENI. In awsvpc mode, ECS provisions a dedicated ENI for each task and attaches it directly to your VPC subnet. The task gets its own private IP address, its own security group membership, and its own route table path.

This design has real benefits: you can apply fine-grained security group rules per task, and there's no port-mapping gymnastics on the host. But it also means that every networking decision β€” routing, DNS, security group egress β€” now lives at the task level, not the instance level. Misconfiguring any one of those layers produces the same symptom: dropped packets.

ENI Limits: The Silent Capacity Problem

Every EC2 instance type has a hard cap on how many ENIs it can attach. When ECS tries to schedule a task and the instance has no free ENI slots, the task fails to start or gets stuck in PROVISIONING. That part is obvious. The subtle problem is when your instance is at the limit and a new task replaces a draining old one β€” there's a brief window where both ENIs need to exist simultaneously, and provisioning fails silently until the old one detaches.

Checking your ENI headroom

Run the following to see how many network interfaces are attached to a specific instance:

aws ec2 describe-instances \
  --instance-ids i-0abc123def456 \
  --query 'Reservations[].Instances[].NetworkInterfaces[].NetworkInterfaceId' \
  --output text | tr '\t' '\n' | wc -l

Compare that number against the ENI limit for your instance type. You can find per-instance limits in the AWS documentation under "IP addresses per network interface per instance type." A t3.medium supports 3 ENIs; a c5.xlarge supports 4. If you're running many tasks on a small instance type, you'll hit the wall quickly.

Solutions for ENI exhaustion

  • Switch to Fargate, which manages ENI provisioning outside your account limits.
  • Use larger instance types with higher ENI limits, or spread tasks across more instances.
  • Enable ENI trunking (also called "trunk ENI" or the "ECS-optimized AMI with branch ENI support") β€” this lets a single trunk ENI carry many task ENIs and dramatically raises effective task density per host.

Security Group Mismatches

This is the most common cause of intermittent drops. Because each task has its own security group attachment, a rule that was on the instance-level security group in bridge mode is no longer automatically inherited.

A typical scenario: your task needs to call an RDS instance. The RDS security group has an inbound rule that references the EC2 instance's security group by ID. After you migrate to awsvpc, traffic originates from the task's security group, not the instance's. RDS drops the connection at the security group level before TCP even completes the handshake.

Auditing security group rules

# Find the security group attached to a running task
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn> \
  --query 'tasks[].attachments[].details' \
  --output json

Look for the networkInterfaceId in the output, then describe that ENI:

aws ec2 describe-network-interfaces \
  --network-interface-ids eni-0abc123 \
  --query 'NetworkInterfaces[].Groups'

You now have the actual security group ID(s) applied to your task. Cross-check these against every downstream resource your task connects to β€” RDS, ElastiCache, other ECS services, internal ALBs. Any rule that references the old EC2 instance security group ID needs a counterpart rule for the task security group.

Route Table and NAT Gateway Issues

Tasks in private subnets need a NAT Gateway (or NAT instance) to reach the internet or AWS service endpoints. If your task is deployed into a subnet whose route table lacks a route to a NAT Gateway, any outbound call to the internet or to a public AWS API endpoint silently drops.

Verifying the route table

# Get the subnet ID from the task ENI
aws ec2 describe-network-interfaces \
  --network-interface-ids eni-0abc123 \
  --query 'NetworkInterfaces[].SubnetId' \
  --output text

# Get the route table associated with that subnet
aws ec2 describe-route-tables \
  --filters Name=association.subnet-id,Values=subnet-0xyz987 \
  --query 'RouteTables[].Routes'

In the output, look for a route with DestinationCidrBlock: 0.0.0.0/0 pointing to a NatGatewayId. If that route is missing, or if it points to an Internet Gateway (IGW) instead β€” which only works for tasks with public IPs β€” outbound traffic will be dropped.

For AWS service calls (S3, ECR, Secrets Manager), consider adding VPC Interface Endpoints or Gateway Endpoints to avoid needing the NAT path entirely. This also reduces data transfer costs.

DNS Resolution Failures Inside the Task

ECS tasks in awsvpc mode use the VPC's DNS resolver at the base of the subnet CIDR plus 2 (e.g., 10.0.0.2 for a 10.0.0.0/16 VPC). Two settings on your VPC must both be enabled: enableDnsSupport and enableDnsHostnames. If either is off, DNS queries from inside the task return nothing, and your application sees connection timeouts rather than helpful "host not found" errors.

Checking VPC DNS settings

aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc123 \
  --attribute enableDnsSupport

aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc123 \
  --attribute enableDnsHostnames

Both should return true. If they don't, enable them in the VPC console or via CLI.

Testing DNS from inside a task

Use ECS Exec to open a shell in the running task and test resolution directly:

aws ecs execute-command \
  --cluster my-cluster \
  --task <task-arn> \
  --container my-container \
  --interactive \
  --command "/bin/sh"

# Inside the task
nslookup s3.amazonaws.com
curl -v https://s3.amazonaws.com 2>&1 | head -20

If nslookup fails, the issue is DNS. If it resolves but curl hangs, the issue is routing or security groups.

MTU Mismatches and Jumbo Frame Drops

AWS VPCs support jumbo frames (9001-byte MTU) within the same region, but the ENI attached to an ECS task has an MTU of 9001 by default. If your traffic traverses a VPN, a Direct Connect connection with different MTU settings, or crosses to an on-premises network, packets larger than the path MTU get silently dropped when "Don't Fragment" is set.

Check the MTU inside your container and compare it against what your network path supports. You can also enable PMTUD (Path MTU Discovery) probing if your security groups allow ICMP type 3 (Destination Unreachable) messages β€” many teams block all ICMP and then wonder why large payloads fail while small ones succeed.

# Inside the task container
ip link show eth0
# Look for the mtu value in the output

# Test with different payload sizes
ping -M do -s 1400 10.0.1.10
ping -M do -s 8900 10.0.1.10

If the large ping fails but the small one succeeds, you have an MTU mismatch. Lower the MTU on the task ENI or fix the ICMP rules to allow fragmentation-needed messages through.

Common Pitfalls

  • Forgetting task-level security groups during deployment automation. If your Terraform or CloudFormation template provisions a new security group for each service but doesn't update downstream rules, you'll see failures only after deploying to a new environment.
  • Mixing awsvpc and bridge tasks on the same cluster. They have different network models; don't assume a working setup in bridge mode means awsvpc will work without review.
  • Not enabling ECS Exec before you need it. Once a task is failing and you need to debug, enabling ECS Exec requires a new task deployment. Enable it by default in your task definition so it's ready when you need it.
  • Relying only on CloudWatch metrics to detect drops. Packet drops at the security group level don't generate CloudWatch errors β€” they look like application timeouts. Add active health checks from inside the VPC.
  • Over-restricting egress rules. It's tempting to lock down outbound traffic entirely, but ECS tasks need to reach the ECS control plane, ECR, CloudWatch Logs, and Secrets Manager. Use VPC endpoints to reduce the attack surface without blocking these calls.

A Diagnostic Checklist

Before escalating or rebuilding your network setup, run through this checklist in order:

  1. Check that the task's ENI was successfully provisioned (no PROVISIONING stuck state; no ENI limit hit).
  2. Verify the task's security group has the necessary inbound and outbound rules for all dependencies.
  3. Confirm that downstream resources (RDS, ElastiCache, other services) allow traffic from the task security group, not just the instance security group.
  4. Check the subnet route table for a valid 0.0.0.0/0 route to a NAT Gateway (for private subnets).
  5. Confirm VPC DNS settings (enableDnsSupport and enableDnsHostnames) are both enabled.
  6. Use ECS Exec to test DNS resolution and basic connectivity from inside the task.
  7. Check for MTU issues if you have VPN or Direct Connect in the path.

Next Steps

Once you've cleared the immediate incident, harden your setup so you don't repeat it:

  • Codify your security group rules in IaC (Terraform, CDK, or CloudFormation) and include automated tests that verify each service can reach its dependencies after deployment.
  • Enable VPC Flow Logs on the subnets where your ECS tasks run. Filter for REJECT actions β€” they'll show you exactly which source IP and destination port is being blocked.
  • Set up VPC Interface Endpoints for ECR, Secrets Manager, and CloudWatch Logs so private-subnet tasks don't depend on NAT for those calls.
  • Enable ENI trunking on your ECS-optimized AMI if you run many tasks per host β€” this removes ENI limits as a constraint entirely.
  • Add a synthetic monitor (AWS CloudWatch Synthetics or a simple Lambda ping) that runs from inside the VPC and alerts you before users notice a connectivity regression.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.