Fixing AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split

Modern deployment systems promise something every engineering team wants:

Safe Deployments
↓
Automatic Recovery

AWS CodeDeploy is designed around this principle.

A deployment fails.

CodeDeploy detects the issue.

Rollback starts automatically.

Production returns to a known-good version.

In theory, the process is simple.

In reality, many teams eventually encounter a much more frustrating scenario:

Deployment Fails
↓
Rollback Starts
↓
Rollback Stalls
↓
Fleet Becomes Split

Some servers run:

Version N

while others run:

Version N-1

The deployment appears neither successful nor fully reverted.

Monitoring becomes confusing.

Debugging becomes difficult.

Customer behavior becomes inconsistent.

The situation is especially dangerous when:

Stateful services are involved
Database schema changes exist
Feature flags are incomplete
Multiple regions are deployed simultaneously

A partially rolled-back fleet can be more damaging than a complete deployment failure.

In this guide, we'll explore why AWS CodeDeploy rollbacks stall, how split fleets emerge, and how to design deployment workflows that recover predictably under failure conditions.

What You Will Learn From This Article

After reading this guide, you'll understand:

How CodeDeploy rollback works.
Why rollback operations stall.
Common causes of split fleets.
Lifecycle hook failures.
Auto Scaling interactions.
Blue/Green deployment considerations.
Monitoring and recovery strategies.
Best practices for production deployments.

Understanding CodeDeploy Rollbacks

A typical deployment workflow:

Deploy New Revision
↓
Health Checks
↓
Deployment Success

If something fails:

Deployment Failure
↓
Automatic Rollback
↓
Previous Revision Restored

The rollback itself is treated as a deployment operation.

This is important because:

Rollback
=
Another Deployment

which means it can fail too.

What Is a Split Fleet?

A split fleet occurs when:

Some Instances
↓
Old Version

while:

Other Instances
↓
New Version

Example:

Instance A → v1.4
Instance B → v1.4
Instance C → v1.5
Instance D → v1.5

Production now runs multiple application versions simultaneously.

Why Split Fleets Are Dangerous

Split fleets often create:

Inconsistent Behavior

Users receive different responses.

Session Problems

Requests hit incompatible servers.

API Compatibility Issues

Services disagree on formats.

Database Conflicts

Application versions expect different schemas.

These failures can be subtle and difficult to detect.

How Rollbacks Work Internally

CodeDeploy performs:

Identify Last Successful Revision
↓
Redeploy Previous Revision
↓
Run Lifecycle Hooks
↓
Verify Health

The rollback must complete successfully on every target instance.

Any interruption may create a split fleet.

Common Cause #1

Lifecycle Hook Failures

CodeDeploy executes scripts such as:

ApplicationStop

BeforeInstall

AfterInstall

ApplicationStart

ValidateService

During rollback these hooks run again.

A failing hook can stop rollback progress.

Example

Suppose:

systemctl stop app

fails because:

Service Already Stopped

Rollback may become stuck even though deployment failure occurred elsewhere.

Common Cause #2

Non-Idempotent Scripts

Many deployment scripts assume:

Run Once

Reality:

Deploy
↓
Rollback
↓
Deploy Again

can occur within minutes.

Scripts must safely execute multiple times.

Bad Example

mv config.old config

First run succeeds.

Second run fails because:

config.old

no longer exists.

Rollback stalls.

Better Approach

if [ -f config.old ]; then
    mv config.old config
fi

Idempotent operations improve rollback reliability.

Common Cause #3

Auto Scaling Group Interference

Many deployments target:

Auto Scaling Groups

During rollback:

Instance Terminates
↓
Replacement Launches

CodeDeploy now faces a changing target set.

Result:

Rollback Progress
↓
Inconsistent State

Common Cause #4

Instance Health Check Failures

Rollback requires healthy instances.

Example:

Rollback Complete
↓
Health Check Fails

CodeDeploy may continue waiting.

Eventually:

Deployment Timeout

occurs.

The fleet remains partially reverted.

Common Cause #5

Application Startup Delays

Many teams underestimate startup time.

Example:

Application Boot
=
90 Seconds

while deployment timeout is:

60 Seconds

Rollback appears broken when the application simply needs more time.

Common Cause #6

Database Migration Dependencies

Dangerous sequence:

Deploy New App
↓
Run Migration
↓
Deployment Fails
↓
Rollback App

Problem:

Database Schema
Still Changed

Old application version may not function.

Rollback cannot fully restore service health.

Why Backward Compatibility Matters

Safe deployments follow:

Old App
Compatible With
New Schema

and:

New App
Compatible With
Old Schema

This allows successful rollbacks.

Common Cause #7

Load Balancer Deregistration Delays

Workflow:

Instance Removed
From Load Balancer
↓
Deployment Begins

If deregistration takes longer than expected:

Traffic
↓
Still Reaches Updating Instance

Deployment failures become more likely.

Diagnosing a Stalled Rollback

Start with:

Deployment Events

Review:

CodeDeploy Deployment Timeline

Look for the last successful step.

Lifecycle Hook Logs

Examine:

/opt/codedeploy-agent/deployment-root/

Logs often reveal:

Script failures
Permission errors
Missing files

CodeDeploy Agent Logs

Check:

/var/log/aws/codedeploy-agent/

These logs frequently identify the root cause.

Detecting Split Fleets

Verify:

Application Version

On every instance.

Example:

cat VERSION

Load Balancer Targets

Confirm which instances receive traffic.

Deployment Revision

Compare deployed bundles across servers.

Do not assume rollback completed everywhere.

Recovery Strategy

When a split fleet exists:

Pause Further Deployments

First establish a known-good state.

Then:

Identify Revision
↓
Redeploy Explicitly

Avoid stacking new deployments on top of a partial rollback.

Blue/Green Deployments Reduce Risk

Traditional deployment:

Update Existing Fleet

Blue/Green deployment:

Old Fleet
↓
New Fleet
↓
Traffic Switch

Benefits:

Faster rollback
Reduced split-fleet risk
Easier validation

This is often the safest production approach.

Monitoring Signals to Watch

Track:

Deployment Success Rate

Rollback Frequency

Lifecycle Hook Failures

Instance Health

Load Balancer Health

Deployment Duration

Unexpected changes often indicate deployment system issues.

Building Rollback-Friendly Deployments

Principles:

Idempotent Scripts

Safe to run repeatedly.

Backward-Compatible Schemas

Support both versions.

Externalized Configuration

Avoid version coupling.

Health Checks

Accurately reflect application readiness.

These dramatically improve recovery reliability.

Real-World Example

A SaaS platform deploys:

Version 3.2

Deployment fails on:

ValidateService

Automatic rollback starts.

Problem:

AfterInstall Script
Assumes Fresh Deployment

Rollback fails on half the fleet.

Final state:

4 Instances → v3.2
4 Instances → v3.1

Customer traffic becomes inconsistent.

Root cause:

Non-Idempotent Lifecycle Script

Fixing the script eliminates future rollback stalls.

Best Practices Checklist

For reliable CodeDeploy rollbacks:

✅ Make lifecycle scripts idempotent

✅ Test rollback scenarios regularly

✅ Monitor deployment health aggressively

✅ Use backward-compatible database migrations

✅ Validate Auto Scaling interactions

✅ Verify load balancer health checks

✅ Use Blue/Green deployments when possible

✅ Review CodeDeploy logs after failures

✅ Keep deployment artifacts immutable

✅ Practice recovery procedures before incidents

Common Mistakes to Avoid

Avoid:

❌ Assuming rollback cannot fail

❌ Using destructive deployment scripts

❌ Deploying incompatible schema changes

❌ Ignoring CodeDeploy agent logs

❌ Running untested rollback procedures

❌ Allowing Auto Scaling to interfere unexpectedly

❌ Treating deployment success as the only metric

Why Rollbacks Need Testing Too

Many teams test:

Deployment Success

but never test:

Deployment Failure
↓
Rollback Recovery

The most critical part of a deployment system is often the recovery path.

A rollback that works only in theory is not a rollback strategy.

Wrapping Summary

AWS CodeDeploy provides powerful deployment automation and rollback capabilities, but rollbacks are not guaranteed to succeed automatically. Because a rollback is itself a deployment operation, it can fail due to lifecycle hook issues, non-idempotent scripts, Auto Scaling interactions, health check failures, startup delays, or incompatible database changes. When this happens, organizations may find themselves operating a split fleet where different instances run different application versions.

The key to preventing rollback stalls is designing deployments with recovery in mind. Idempotent scripts, backward-compatible schema changes, reliable health checks, proper monitoring, and Blue/Green deployment strategies significantly reduce risk. Just as importantly, teams should routinely test rollback scenarios rather than assuming they will work during a production incident.

The most resilient deployment systems are not the ones that never fail. They are the ones that recover predictably when failures occur.

AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split Fixing

Inconsistent Behavior

Session Problems

API Compatibility Issues

Database Conflicts

Lifecycle Hook Failures

Non-Idempotent Scripts

Auto Scaling Group Interference

Instance Health Check Failures

Application Startup Delays

Database Migration Dependencies

Load Balancer Deregistration Delays

Deployment Events

Lifecycle Hook Logs

CodeDeploy Agent Logs

Application Version

Load Balancer Targets

Deployment Revision

Deployment Success Rate

Rollback Frequency

Lifecycle Hook Failures

Instance Health

Load Balancer Health

Deployment Duration

Idempotent Scripts

Backward-Compatible Schemas

Externalized Configuration

Health Checks

Related Articles

Fixing AWS CloudFront Cache Invalidations That Still Serve Stale Content

Sentry vs Highlight.io for Error Monitoring: Pricing, Session Limits, and Real Noise

Fixing Silent Failures When Nginx Truncates Upstream Responses

Comments (0)

Leave a Comment

AWS CodeDeploy Rollbacks That Stall and Leave Your Fleet Split Fixing

Inconsistent Behavior

Session Problems

API Compatibility Issues

Database Conflicts

Lifecycle Hook Failures

Non-Idempotent Scripts

Auto Scaling Group Interference

Instance Health Check Failures

Application Startup Delays

Database Migration Dependencies

Load Balancer Deregistration Delays

Deployment Events

Lifecycle Hook Logs

CodeDeploy Agent Logs

Application Version

Load Balancer Targets

Deployment Revision

Deployment Success Rate

Rollback Frequency

Lifecycle Hook Failures

Instance Health

Load Balancer Health

Deployment Duration

Idempotent Scripts

Backward-Compatible Schemas

Externalized Configuration

Health Checks

Related Articles

Fixing AWS CloudFront Cache Invalidations That Still Serve Stale Content

Sentry vs Highlight.io for Error Monitoring: Pricing, Session Limits, and Real Noise

Fixing Silent Failures When Nginx Truncates Upstream Responses

Comments (0)

Leave a Comment

Stay ahead of the curve