Fixing AWS EKS Node Group Scaling That Stalls on Pending Pods
Your pods are stuck in Pending state, the node group has headroom in its max count, and Cluster Autoscaler is supposedly running. Yet nothing happens. This is one of the most frustrating situations in EKS because the system looks healthy from the outside while quietly doing nothing.
This guide walks through every layer where scaling can break β from IAM permissions to taints to resource requests β so you can pinpoint the exact failure and fix it.
What you'll learn
- How to read Cluster Autoscaler logs to find the real failure reason
- How IAM, ASG tags, and launch template mismatches block scaling
- How resource requests and taints prevent pods from ever triggering scale-up
- How to validate your node group configuration end-to-end
- A repeatable checklist to diagnose future stalls quickly
Prerequisites
You need kubectl configured against your cluster, the AWS CLI authenticated with enough permissions to describe Auto Scaling Groups and IAM policies, and Cluster Autoscaler already deployed. The steps below apply to EKS with managed or self-managed node groups backed by EC2 Auto Scaling Groups (ASGs).
Step 1: Confirm the pods are actually unschedulable
Before blaming the autoscaler, confirm that the pods are stuck for a scheduling reason and not something else entirely.
kubectl get pods -A | grep Pending
kubectl describe pod <pod-name> -n <namespace>Look at the Events section at the bottom of the describe output. You want to see something like:
Warning FailedScheduling 0/3 nodes are available: 3 Insufficient cpu.If you see FailedScheduling with an insufficiency reason, the scheduler tried and failed to place the pod. That is the signal Cluster Autoscaler needs to act. If the event says something else β like an image pull error or a missing secret β the pod is not pending for capacity reasons and autoscaling will never fix it.
Step 2: Read the Cluster Autoscaler logs
This is the most direct way to find out why scaling isn't happening. Cluster Autoscaler logs its decision-making in detail, but most people skip this step.
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=200 | grep -E "scale-up|unschedulable|skipping|node group"The lines to look for:
- Scale-up evaluation β confirms the autoscaler found the pending pods
- No candidates or skipping node group β tells you which node group was rejected and why
- Not ready or recent scale activity β indicates a cooldown or a node group in a bad state
A common log message that trips people up is:
Scale-up: node group eks-workers did not increase size, reason: max size reachedThat one is obvious once you see it, but it's easy to miss when you're staring at the AWS console and the Max count looks fine. Double-check the actual ASG max in the EC2 console, not just the node group setting in EKS β they can drift if you've edited one and not the other.
Step 3: Verify IAM permissions for the autoscaler
Cluster Autoscaler calls the AWS Auto Scaling API to add capacity. If its IAM role is missing permissions, it fails silently from the pod's perspective.
The minimum required actions are:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"eks:DescribeNodegroup"
],
"Resource": "*"
}
]
}If you're using IRSA (IAM Roles for Service Accounts), confirm the service account annotation points to the right role ARN and that the role's trust policy allows the OIDC provider for your cluster. A mismatched OIDC issuer is a silent killer here.
kubectl -n kube-system describe sa cluster-autoscaler | grep annotationsThe output should show eks.amazonaws.com/role-arn set to your role. If it's blank, the autoscaler is running with the node instance profile instead of its dedicated role, which usually doesn't have SetDesiredCapacity.
Step 4: Check ASG tags and node group discovery
Cluster Autoscaler discovers which ASGs it should manage through tags. If the tags are missing or wrong, the autoscaler ignores the node group entirely β and it won't log a loud error about it.
Every ASG managed by Cluster Autoscaler needs these two tags:
| Tag Key | Tag Value |
|---|---|
k8s.io/cluster-autoscaler/enabled | true |
k8s.io/cluster-autoscaler/<cluster-name> | owned |
EKS managed node groups usually add these automatically. Self-managed node groups often don't. Check with:
aws autoscaling describe-auto-scaling-groups \
--query "AutoScalingGroups[*].{Name:AutoScalingGroupName,Tags:Tags}" \
--output json | grep -A5 "cluster-autoscaler"Also confirm the autoscaler deployment uses the right cluster name in its command arguments. A mismatched --cluster-name flag means it queries for tags that don't exist on your ASGs.
Step 5: Inspect the launch template for blocking issues
Even if the ASG receives the scale-up signal, new nodes can fail to join the cluster if the launch template has drifted. This shows up as the ASG increasing its desired count, instances launching, but nodes never appearing in kubectl get nodes.
Check the ASG activity history first:
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name <asg-name> \
--max-items 10If instances are launching but not registering, SSH or use SSM to check the bootstrap logs on a new node:
/var/log/cloud-init-output.log
/var/log/aws-routed-eni/plugin.log # for VPC CNI issuesCommon launch template problems that block node registration:
- Outdated AMI that's incompatible with the current Kubernetes control plane version
- User data that references a bootstrap script path that no longer exists in the AMI
- Security group rules that block the kubelet from reaching the API server on port 443
- Missing
eks:RegisterNodeWithClusterpermission on the node IAM role
Step 6: Audit resource requests on your pods
This one catches a lot of people off guard. Cluster Autoscaler only scales up when it believes a node from the target group can actually fit the pending pod. If your pod requests more CPU or memory than any instance type in the node group can provide, the autoscaler concludes that scaling won't help and does nothing.
Check what your pod is requesting:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'Then check the allocatable resources on existing nodes of the same type:
kubectl get nodes -o custom-columns="NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"Remember that allocatable capacity is always less than the raw instance size β system daemons and the kubelet reserve a portion. On a small instance type, the reservations can eat a significant chunk of total memory. If your pod requests 14Gi on a 16Gi node, it may never fit once reservations are subtracted.
Step 7: Check for taints and tolerations mismatches
If your node group applies a taint and your pods don't have a matching toleration, the autoscaler skips that node group as a scale-up candidate. This is intentional behavior, but it often surprises teams that added a taint for a different workload and forgot about it.
# Check node taints on existing nodes in the group
kubectl get nodes -o custom-columns="NAME:.metadata.name,TAINTS:.spec.taints"
# Check what tolerations the pending pod has
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'If you use managed node groups, taints can also be set at the node group level in EKS and will be applied to every node the group provisions. Check the node group configuration in the AWS console or with the CLI:
aws eks describe-nodegroup \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name> \
--query "nodegroup.taints"Common pitfalls
Scaling cooldown hiding the problem
Cluster Autoscaler has a scale-up and scale-down cooldown period (configurable via flags, defaulting to a few minutes each). If a previous scale event just completed, the autoscaler will wait before acting again. Check the logs for messages about waiting for cooldown. This is not a bug, but it can make the system look stuck when it's just pausing.
Multiple node groups with conflicting priorities
If you have several node groups, Cluster Autoscaler uses expander logic to pick which one to grow. The default random expander picks one at random, and if that group has an issue, the autoscaler may keep retrying it instead of trying a healthy group. Switch to the least-waste or priority expander if you have heterogeneous groups.
--expander=least-wasteSpot instance interruption leaving ASG in a bad state
If your node group uses Spot instances and a large interruption event hit, the ASG may be hitting capacity limits in those particular Availability Zones. Use mixed instance policies with multiple instance types and multiple AZs to reduce this risk. The autoscaler will log AWS: not enough capacity if this is happening.
Version skew between Cluster Autoscaler and Kubernetes
The Cluster Autoscaler version must match your Kubernetes minor version. Running a v1.26 autoscaler against a v1.29 cluster is unsupported and can cause unpredictable behavior. Check the Cluster Autoscaler release page for the compatibility matrix and update accordingly.
Next steps
Once your scaling is unblocked, a few actions will keep you from hitting the same wall again:
- Set up a CloudWatch alarm on the
cluster-autoscalerpod's log group filtered forskipping node groupso you get alerted before pods sit pending for long periods. - Pin your Cluster Autoscaler version to match your EKS Kubernetes minor version and update both together during cluster upgrades.
- Add resource requests to every workload β pods without requests are invisible to the autoscaler's bin-packing logic and can cause unpredictable scheduling decisions.
- Test your scaling path quarterly by intentionally deploying a resource-heavy job and watching the autoscaler logs in real time to confirm the end-to-end path still works.
- Review ASG limits and IAM policies after any infrastructure-as-code changes, since Terraform or CloudFormation updates sometimes reset tags or cap values silently.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!