Fixing AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split
Your deployment failed, CodeDeploy kicked off a rollback, and now it's sitting at 47% complete β not progressing, not failing cleanly. Half your instances are running the previous revision, half are stuck on the broken one, and your monitoring dashboard looks like a Jackson Pollock. This is one of the most disorienting failure modes in AWS, and the default documentation won't get you out of it quickly.
This guide walks you through why rollbacks stall, how to identify the specific cause in your environment, and the steps to recover your fleet without making things worse.
What You'll Learn
- Why CodeDeploy rollbacks stall instead of completing
- How to read deployment logs and event timelines to find the root cause
- How to manually recover a split fleet when automation has frozen
- Configuration changes that prevent this pattern from repeating
- When to skip the rollback entirely and redeploy forward instead
Prerequisites
You should have the AWS CLI installed and configured with permissions to describe and stop deployments, and access to the EC2 instances in the affected deployment group β either via SSM Session Manager or SSH. The examples here assume you're using an EC2/on-premises compute platform, not Lambda or ECS.
Why Rollbacks Stall in the First Place
CodeDeploy rollbacks are not magic. A rollback is just another deployment β it creates a new deployment ID, runs the same lifecycle hooks in the same order, and has the same timeout constraints as any forward deployment. If the conditions that caused your original deployment to fail are still present on the instances, the rollback will fail too.
The most common causes of a stalled rollback are:
- A lifecycle hook script hanging β
BeforeInstall,AfterInstall, orApplicationStartscripts that don't exit cleanly, often because a process they're waiting on never responds. - The CodeDeploy agent itself is stopped or crashed on one or more instances, so those instances never acknowledge the rollback deployment.
- Instance health check failures β if you're behind an ALB or Classic Load Balancer, an instance that fails health checks during deregistration gets stuck in a draining state, blocking the lifecycle.
- Deployment group minimum healthy hosts constraint β CodeDeploy won't deploy to more instances if doing so would drop healthy capacity below your configured threshold, leaving the rollback frozen mid-way.
- S3 or network connectivity issues on specific instances that prevent the rollback artifact from being fetched.
Step 1: Identify the Exact Failure Point
Before touching anything, get a clear picture of where the rollback has stopped. Run this CLI command to see the per-instance status:
aws deploy list-deployment-instances \
--deployment-id d-YOURROLLBACKID \
--instance-status-filter Failed Pending InProgress \
--query 'instancesList' \
--output table
For each instance that's in InProgress or Failed state, pull the lifecycle event detail:
aws deploy get-deployment-instance \
--deployment-id d-YOURROLLBACKID \
--instance-id i-YOURINSTANCEID
The output includes a lifecycleEvents array. Look for the event with status InProgress or Failed and note the diagnostics block β the message field usually tells you exactly which script timed out or which error occurred.
Step 2: Get the Agent Logs Off the Instance
The AWS console gives you a summary, but the agent logs on the instance give you the full picture. If you have SSM access:
aws ssm start-session --target i-YOURINSTANCEID
Then on the instance:
# CodeDeploy agent log
sudo tail -n 200 /var/log/aws/codedeploy-agent/codedeploy-agent.log
# Deployment-specific script output
sudo find /opt/codedeploy-agent/deployment-root -name '*.log' \
-newer /tmp -exec ls -lt {} +
The deployment-root directory contains a folder per deployment ID, and inside that, per-lifecycle-event log files. The script stdout and stderr for whichever hook is hanging will be there. A log that ends mid-output without an exit code is a strong signal the script is still running (or was killed by the timeout without writing a final line).
Step 3: Check the CodeDeploy Agent Status
On instances that show Pending in the deployment for longer than a few minutes, the agent is likely not running. Check it:
# Amazon Linux 2 / AL2023
sudo systemctl status codedeploy-agent
# If stopped:
sudo systemctl start codedeploy-agent
sudo systemctl enable codedeploy-agent
If the agent crashes repeatedly, check /var/log/aws/codedeploy-agent/codedeploy-agent.log for Ruby runtime errors or credential failures. A common cause is an expired instance profile or a missing IAM permission β the agent needs s3:GetObject on your artifact bucket and the standard codedeploy:* permissions for the instance role.
Step 4: Stop the Stalled Rollback and Recover Manually
If the rollback deployment has been stuck for more than 15 minutes and you've identified the root cause, stop waiting for it and take control. First, stop the stalled deployment:
aws deploy stop-deployment \
--deployment-id d-YOURROLLBACKID \
--auto-rollback-enabled
Note: passing --auto-rollback-enabled here tells CodeDeploy not to trigger yet another automatic rollback (which would fail for the same reason). Verify the deployment status flips to Stopped.
Now you have a choice: fix the underlying issue and re-trigger the rollback, or deploy a known-good revision forward. In most production outage scenarios, deploying forward is faster and more predictable than debugging rollback hooks under pressure.
Option A: Re-trigger the Rollback
After fixing the root cause (restarting agents, clearing stuck processes, fixing the hanging hook script), create a new deployment manually pointing to the previous revision:
aws deploy create-deployment \
--application-name YourAppName \
--deployment-group-name YourDeploymentGroup \
--s3-location bucket=your-bucket,bundleType=zip,key=revisions/previous-good-revision.zip \
--deployment-config-name CodeDeployDefault.OneAtATime \
--description "Manual rollback after stall"
Using OneAtATime is deliberate β it gives you visibility into each instance and limits the blast radius if the same issue reappears.
Option B: Deploy Forward
If the previous revision is what caused the hook failures in the first place (for example, a BeforeInstall script that assumes a file exists that was removed in the broken release), rolling back to it will fail again. In that case, create a patch release that reverts the breaking change at the application level and deploy it forward as a new revision.
Common Pitfalls That Make This Worse
Minimum Healthy Hosts Set Too High
If your deployment group is configured with a high minimum healthy percentage β say 90% β and a handful of instances are stuck, CodeDeploy will refuse to continue the rollback because doing so would drop healthy capacity below that threshold. Temporarily lowering this value in the deployment group settings can unblock a stalled rollback, but do it carefully and restore it afterward.
Lifecycle Hook Scripts Without Exit Codes
A script that exits with code 0 on success but doesn't explicitly exit on a caught exception will appear to CodeDeploy as hanging. Every lifecycle hook script should have an explicit exit 1 on failure paths:
#!/bin/bash
set -e
# Your setup steps here
systemctl restart myapp || { echo "Failed to restart myapp"; exit 1; }
exit 0
Using set -e at the top means any unhandled command failure causes the script to exit non-zero immediately, which CodeDeploy will correctly interpret as a lifecycle failure rather than a timeout.
Load Balancer Deregistration Delays
If your instances have a long connection-draining timeout configured on the load balancer (sometimes set to 5 or even 10 minutes for long-running requests), each instance swap during a rollback takes that long just for deregistration. Multiply that by your fleet size and an "AllAtOnce" rollback becomes impractical. Set the BlockTraffic timeout in your appspec.yml to match or exceed your deregistration delay, or reduce the draining timeout for non-critical traffic.
Confusing the Rollback Deployment ID with the Original
When CodeDeploy triggers an automatic rollback, it creates a new deployment ID prefixed with d-. Teams often check the original failing deployment's status and wonder why it's still showing Failed β they're not watching the rollback deployment at all. Filter by deployment group and sort by creation time to find the correct ID:
aws deploy list-deployments \
--application-name YourAppName \
--deployment-group-name YourDeploymentGroup \
--create-time-range start=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--query 'deployments' \
--output table
Preventing Split-Fleet Scenarios Going Forward
The real goal is never being in this situation again. A few configuration habits that help:
- Use
CodeDeployDefault.OneAtATimefor critical deployments. It's slower, but a failure stops at one instance and the fleet stays largely healthy. - Set explicit script timeouts in
appspec.ymlusing thetimeoutfield on each hook. Default timeouts are generous β a stuck script will hold up a deployment for a long time before CodeDeploy gives up. Tighter timeouts mean faster failures and faster rollbacks. - Monitor agent health as a separate check. Add a CloudWatch alarm or a simple SSM Run Command scheduled script that verifies the CodeDeploy agent is running on all instances in your deployment group. Agent failures are silent until a deployment starts.
- Test lifecycle hook scripts locally before committing them. A script that works on your laptop but assumes a specific user, path, or environment variable will fail on EC2 in unpredictable ways.
- Keep a known-good revision tagged in S3. When you need to recover fast, having a stable reference artifact you can point a manual deployment at saves the time needed to figure out which S3 key was the last good build.
Wrapping Up
A split fleet from a stalled rollback is recoverable β but only if you move methodically. Grabbing the rollback deployment ID, reading per-instance lifecycle event logs, checking agent status, and then deciding whether to re-trigger the rollback or deploy forward is the sequence that gets you back to a consistent state fastest.
Here are the concrete next steps to take after you've recovered:
- Add
set -eand explicitexit 1calls to every lifecycle hook script in your appspec. - Review your deployment group's minimum healthy hosts percentage and verify it's appropriate for your fleet size.
- Set up a CloudWatch alarm on the
DeploymentSuccessandDeploymentFailuremetrics for each deployment group so rollbacks are visible immediately in your alerting system. - Document your manual recovery runbook β the exact CLI commands β so any on-call engineer can execute it without hunting through AWS docs at 2 AM.
- Run a chaos exercise: stop the CodeDeploy agent on one instance intentionally and verify your monitoring catches it before the next real deployment does.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!