The scariest deploys I've run were the all-or-nothing kind: merge, deploy to 100% of traffic, and pray the dashboards stay flat. When something broke, the only lever was a full rollback, and by then the damage was done. Progressive delivery flips that. You decouple deploy from release, ship code dark, and turn it on for a sliver of traffic while watching the metrics that matter.

On AWS, you can assemble a solid progressive delivery setup from managed pieces. Here's the stack I use and how the parts fit.

Deploy is not release

The core idea: getting code onto servers (deploy) should be boring and frequent. Deciding who experiences a new behavior (release) should be a runtime control you can dial up or roll back in seconds, without a redeploy. Feature flags are what make that separation real.

Two complementary mechanisms cover most needs:

  • Feature flags for per-user / per-cohort logic, managed by AWS AppConfig (which absorbed the CloudWatch Evidently feature-flag capability).
  • Traffic shifting at the infrastructure layer, canary and linear deployments via CodeDeploy for Lambda and ECS.

Feature flags with AppConfig

AppConfig serves flag configuration to your app and, importantly, validates and deploys config changes with the same canary discipline as code. You fetch the flag at runtime; flipping it is a config deployment, not a code release.

import boto3, json

appconfig = boto3.client("appconfigdata")

session = appconfig.start_configuration_session(
    ApplicationIdentifier="checkout",
    EnvironmentIdentifier="prod",
    ConfigurationProfileIdentifier="feature-flags",
)
token = session["InitialConfigurationToken"]

resp = appconfig.get_latest_configuration(ConfigurationToken=token)
flags = json.loads(resp["Configuration"].read() or "{}")

def new_pricing_enabled(user_id):
    flag = flags.get("new-pricing-engine", {})
    return flag.get("enabled", False)

Because the flag lives in config, a bad feature is disabled by a config rollback that propagates in seconds, no pipeline run required.

Canary traffic shifting with CodeDeploy

For the infrastructure layer, CodeDeploy shifts traffic to a new ECS task set or Lambda version on a schedule, then automatically rolls back if a CloudWatch alarm fires during the bake. Here's an ECS deployment configured for a 10%-then-rest canary with alarm-based rollback in Terraform:

resource "aws_codedeploy_deployment_group" "checkout" {
  app_name               = aws_codedeploy_app.checkout.name
  deployment_group_name  = "checkout-prod"
  service_role_arn       = aws_iam_role.codedeploy.arn
  deployment_config_name = "CodeDeployDefault.ECSCanary10Percent5Minutes"

  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
  }

  alarm_configuration {
    enabled = true
    alarms  = [aws_cloudwatch_metric_alarm.checkout_5xx.alarm_name]
  }

  blue_green_deployment_config {
    deployment_ready_option {
      action_on_timeout = "CONTINUE_DEPLOYMENT"
    }
  }
}

With ECSCanary10Percent5Minutes, 10% of traffic goes to the new version, bakes for 5 minutes against the 5xx alarm, and only then shifts the remaining 90%. If the alarm trips during the bake, CodeDeploy reverts to the old task set automatically.

The point of progressive delivery isn't to ship slower. It's to shrink the blast radius of a mistake from "everyone" to "1% for five minutes," so that being wrong is cheap.

Closing the loop with automated analysis

A canary is only as good as the metric watching it. I define the rollback alarm on the signal that actually represents a bad release, not just CPU. For checkout that's the 5xx rate and conversion drop:

resource "aws_cloudwatch_metric_alarm" "checkout_5xx" {
  alarm_name          = "checkout-5xx-canary"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 5
  dimensions = { LoadBalancer = aws_lb.checkout.arn_suffix }
}

Putting it together

  1. Merge to main; CI builds an immutable artifact.
  2. CodeDeploy canaries the artifact at 10% behind an alarm bake.
  3. The new behavior ships behind a flag, default off.
  4. Enable the flag for internal users, then 5%, then a cohort, via AppConfig.
  5. If anything degrades, flip the flag off or let the alarm roll back the deploy.

Two independent safety nets, deploy-time canary and runtime flag, mean a bad change rarely reaches more than a fraction of users before it's contained.

Takeaways

  • Separate deploy from release: ship code dark and control exposure with runtime feature flags.
  • Use AWS AppConfig for per-cohort flags so disabling a feature is a fast config rollback, not a redeploy.
  • Use CodeDeploy canary configs (e.g. ECSCanary10Percent5Minutes) with alarm-based auto-rollback at the infrastructure layer.
  • Wire rollback alarms to the metric that represents a bad release (5xx, conversion), not generic resource metrics.