Zero-downtime deploys on ECS with CodeDeploy, The Cloud Ledger

For a long time our "deploy" was a rolling ECS update that occasionally served 502s for thirty seconds while old and new tasks fought over connections. It worked until a bad release shipped a config bug and we had no fast way back, the old task definition was already gone. Moving to blue/green deployments with CodeDeploy fixed both: no error blip during the cutover, and an instant rollback that just flips traffic back.

Here's how the pieces fit and the gotchas that cost me a few late nights.

Rolling vs blue/green on ECS

ECS's native rolling update replaces tasks in place, it drains some old tasks, starts new ones, and repeats. It's simple and free, but during the overlap you're serving two versions, and rollback means redeploying the previous image (slow). CodeDeploy's blue/green stands up a completely separate "green" task set behind a second target group, validates it, then shifts the load balancer's traffic from blue to green. Old tasks stick around for a configurable bake window so rollback is just a traffic flip.

What you need wired up

An ECS service configured with deployment controller type CODE_DEPLOY.
An Application Load Balancer with two target groups (blue and green).
A production listener, and optionally a test listener for pre-cutover validation.
A CodeDeploy application and deployment group referencing all of the above.

{
  "version": 1,
  "Resources": [{
    "TargetService": {
      "Type": "AWS::ECS::Service",
      "Properties": {
        "TaskDefinition": "<TASK_DEFINITION>",
        "LoadBalancerInfo": {
          "ContainerName": "web",
          "ContainerPort": 8080
        }
      }
    }
  }]
}

That appspec.yaml/JSON tells CodeDeploy which task definition and container/port to register with the green target group. The <TASK_DEFINITION> placeholder gets substituted at deploy time.

Choose a traffic-shifting strategy

CodeDeploy gives you three shift styles, and the choice is a real trade between speed and safety.

Strategy	Behavior	When I use it
All-at-once	100% flips instantly	Internal tools, low risk
Canary	X% now, rest after N minutes	Default for prod services
Linear	+X% every N minutes	High-traffic, want gradual exposure

I default to CodeDeployDefault.ECSCanary10Percent5Minutes, 10% of traffic to green for 5 minutes, then the rest. That window is enough for alarms and lifecycle hooks to catch a bad release before it sees full load.

Validate before you fully cut over

The bake window is only useful if something is watching. Two mechanisms do that. Lifecycle hooks let you run a Lambda at AfterAllowTestTraffic to hit the green task set through the test listener and fail the deploy if smoke tests don't pass. And CloudWatch alarms on the deployment group auto-rollback if 5xx rate or latency crosses a threshold during the shift.

resource "aws_codedeploy_deployment_group" "web" {
  app_name               = aws_codedeploy_app.web.name
  deployment_group_name  = "web-bluegreen"
  service_role_arn       = aws_iam_role.codedeploy.arn
  deployment_config_name = "CodeDeployDefault.ECSCanary10Percent5Minutes"

  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
  }

  alarm_configuration {
    enabled = true
    alarms  = [aws_cloudwatch_metric_alarm.web_5xx.alarm_name]
  }

  blue_green_deployment_config {
    terminate_blue_instances_on_deployment_success {
      action                           = "TERMINATE"
      termination_wait_time_in_minutes = 15
    }
  }
}

The 15-minute termination wait is your free insurance. Blue tasks stay alive after cutover, so if a problem only surfaces under full load, rollback is a traffic flip back to blue, seconds, not a redeploy.

Don't forget connection draining

Zero-downtime also depends on graceful shutdown. Set the target group's deregistration delay long enough to drain in-flight requests, and make sure your container handles SIGTERM by finishing current work before exiting. I learned this when blue/green cutovers were clean but the eventual blue-task termination still cut a handful of long-running requests, the deregistration delay was too short.

Takeaways

Blue/green with CodeDeploy eliminates the version-overlap error blip and makes rollback a fast traffic flip.
You need two target groups and an ECS service set to the CODE_DEPLOY controller.
Default to a canary shift with CloudWatch alarms and lifecycle-hook smoke tests for auto-rollback.
Keep blue tasks alive with a termination wait, and tune deregistration delay plus SIGTERM handling so in-flight requests drain.