Zero-downtime deploys on ECS with CodeDeploy
Blue/green deployments on ECS that roll back automatically when health checks fail.
For a long time our "deploy" was a rolling ECS update that occasionally served 502s for thirty seconds while old and new tasks fought over connections. It worked until a bad release shipped a config bug and we had no fast way back, the old task definition was already gone. Moving to blue/green deployments with CodeDeploy fixed both: no error blip during the cutover, and an instant rollback that just flips traffic back.
Here's how the pieces fit and the gotchas that cost me a few late nights.
Rolling vs blue/green on ECS
ECS's native rolling update replaces tasks in place, it drains some old tasks, starts new ones, and repeats. It's simple and free, but during the overlap you're serving two versions, and rollback means redeploying the previous image (slow). CodeDeploy's blue/green stands up a completely separate "green" task set behind a second target group, validates it, then shifts the load balancer's traffic from blue to green. Old tasks stick around for a configurable bake window so rollback is just a traffic flip.
What you need wired up
- An ECS service configured with deployment controller type
CODE_DEPLOY. - An Application Load Balancer with two target groups (blue and green).
- A production listener, and optionally a test listener for pre-cutover validation.
- A CodeDeploy application and deployment group referencing all of the above.
{
"version": 1,
"Resources": [{
"TargetService": {
"Type": "AWS::ECS::Service",
"Properties": {
"TaskDefinition": "<TASK_DEFINITION>",
"LoadBalancerInfo": {
"ContainerName": "web",
"ContainerPort": 8080
}
}
}
}]
}
That appspec.yaml/JSON tells CodeDeploy which task definition and container/port to register with the green target group. The <TASK_DEFINITION> placeholder gets substituted at deploy time.
Choose a traffic-shifting strategy
CodeDeploy gives you three shift styles, and the choice is a real trade between speed and safety.
| Strategy | Behavior | When I use it |
|---|---|---|
| All-at-once | 100% flips instantly | Internal tools, low risk |
| Canary | X% now, rest after N minutes | Default for prod services |
| Linear | +X% every N minutes | High-traffic, want gradual exposure |
I default to CodeDeployDefault.ECSCanary10Percent5Minutes, 10% of traffic to green for 5 minutes, then the rest. That window is enough for alarms and lifecycle hooks to catch a bad release before it sees full load.
Validate before you fully cut over
The bake window is only useful if something is watching. Two mechanisms do that. Lifecycle hooks let you run a Lambda at AfterAllowTestTraffic to hit the green task set through the test listener and fail the deploy if smoke tests don't pass. And CloudWatch alarms on the deployment group auto-rollback if 5xx rate or latency crosses a threshold during the shift.
resource "aws_codedeploy_deployment_group" "web" {
app_name = aws_codedeploy_app.web.name
deployment_group_name = "web-bluegreen"
service_role_arn = aws_iam_role.codedeploy.arn
deployment_config_name = "CodeDeployDefault.ECSCanary10Percent5Minutes"
auto_rollback_configuration {
enabled = true
events = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
}
alarm_configuration {
enabled = true
alarms = [aws_cloudwatch_metric_alarm.web_5xx.alarm_name]
}
blue_green_deployment_config {
terminate_blue_instances_on_deployment_success {
action = "TERMINATE"
termination_wait_time_in_minutes = 15
}
}
}
The 15-minute termination wait is your free insurance. Blue tasks stay alive after cutover, so if a problem only surfaces under full load, rollback is a traffic flip back to blue, seconds, not a redeploy.
Don't forget connection draining
Zero-downtime also depends on graceful shutdown. Set the target group's deregistration delay long enough to drain in-flight requests, and make sure your container handles SIGTERM by finishing current work before exiting. I learned this when blue/green cutovers were clean but the eventual blue-task termination still cut a handful of long-running requests, the deregistration delay was too short.
Takeaways
- Blue/green with CodeDeploy eliminates the version-overlap error blip and makes rollback a fast traffic flip.
- You need two target groups and an ECS service set to the
CODE_DEPLOYcontroller. - Default to a canary shift with CloudWatch alarms and lifecycle-hook smoke tests for auto-rollback.
- Keep blue tasks alive with a termination wait, and tune deregistration delay plus
SIGTERMhandling so in-flight requests drain.