7. Advanced deployment patterns

Rolling deployments are fine for small teams. At scale you reach for blue-green (instant cutover, instant rollback) or canary (gradual rollout with metrics gates).

The default ECS rolling update is fine for most deploys: stop one old task, start one new one, wait for the health check, repeat. The capacity dip during the swap is brief, and the deployment circuit breaker (covered later in this page) rolls back automatically when health checks don't come back. Two other patterns -- blue-green and canary -- exist for the cases where even that small capacity dip or that small window of "is the new version healthy?" matters.

This page covers all three. The circuit-breaker work is hands-on (it's a one-command update to the existing ECS service). Blue-green and canary are described in enough detail that you can decide when they're worth the extra moving parts, without building the full infrastructure here -- both require duplicate target groups, weighted listener rules, and a deploy script with metric gates that don't fit cleanly into the chapter's "one worked example" shape.

Blue-green deployments

Blue-green deployments run two complete production environments: "blue" (current version) and "green" (new version). You deploy the new version to green, test it thoroughly, then switch traffic from blue to green. If problems occur, switch traffic back to blue instantly. The entire infrastructure runs both versions briefly, then only the new version continues.

ECS implements blue-green through target groups: Create two target groups (blue and green) pointing to the same load balancer. Blue target group routes to current tasks. Deploy new tasks into green target group. Test green thoroughly. Modify load balancer listener to route traffic to green instead of blue. Old blue tasks continue running (ready for instant rollback) until you're confident in green, then terminate them.

Advantages

Instant rollback: Switch load balancer traffic back to blue target group in seconds, not minutes
Thorough testing: Green environment handles real traffic before becoming primary, revealing issues blue-only testing missed
Minimal risk: Problems affect green (non-production) environment; blue remains available

Disadvantages

Double infrastructure cost: Both blue and green run simultaneously during deployment
Database complexity: Both versions share the same database, schema changes require careful compatibility planning
Configuration overhead: Managing two target groups, two task definitions, and load balancer rules adds complexity

Blue-green is worth the extra complexity when even a brief capacity dip is unacceptable and the budget supports running two copies of the stack during deploys. Checkout flows, payment processing, and healthcare APIs are the canonical examples.

Canary deployments

Canary deployments gradually roll out new versions to small percentages of traffic. Instead of switching 100% of users to the new version at once, route 5% of traffic to the new version while monitoring error rates and latency. If metrics look good, increase to 25%, then 50%, then 100%. If metrics degrade, halt the rollout and investigate.

The name is from the coal-mine canary: a small, sensitive sample exposed first to detect problems before the rest of the population is. A canary deployment does the same thing with traffic: a small fraction of real users hit the new version, and the metrics from that fraction decide whether the rollout proceeds.

Implementing canaries with ECS: Use weighted target groups in your Application Load Balancer. Configure listener rules routing 95% of traffic to stable version, 5% to canary version. Monitor CloudWatch metrics for the canary target group. If error rates, latency, or other metrics exceed thresholds, roll back automatically. If metrics look good, gradually increase canary weight to 25%, then 50%, then 100%.

Advantages

Early detection: Problems affect only 5% of users initially, not everyone
Gradual confidence building: Each successful stage increases confidence in the new version
Real-world validation: Canary traffic is real users, not synthetic tests, actual production conditions

Disadvantages

Slow rollout: Full deployment takes hours instead of minutes (deliberate, but requires patience)
Some users affected by bugs: The 5% canary traffic experiences problems before rollback occurs
Metric interpretation complexity: Low canary traffic means low sample sizes, 5% traffic might not trigger rare bugs

Canary deployments make most sense for high-traffic applications where the 5% sample is large enough to reveal a problem within minutes. Recommendation engines, search ranking changes, and feed algorithms are good fits because the regression signal is statistical and only shows up at volume.

ECS deployment circuit breakers

ECS deployment circuit breakers automatically roll back deployments when tasks fail to start or fail health checks repeatedly. Without circuit breakers, failed deployments continue retrying until you manually intervene, wasting time and potentially causing extended outages.

Enable circuit breaker for your service

Terminal

aws ecs update-service \
    --cluster news-api-cluster \
    --service news-api-service \
    --deployment-configuration '{
        "deploymentCircuitBreaker": {
            "enable": true,
            "rollback": true
        },
        "minimumHealthyPercent": 100,
        "maximumPercent": 200
    }'

What this configuration does

enable: true turns on circuit breaker, ECS monitors deployment health.

rollback: true automatically reverts to previous task definition when deployment fails repeatedly.

minimumHealthyPercent: 100 never drops below 100% of desired task count during deployment. If you want 2 tasks, ECS maintains at least 2 healthy tasks throughout deployment.

maximumPercent: 200 allows up to 200% of desired task count during deployment. With 2 tasks desired, ECS can run up to 4 tasks temporarily, 2 old, 2 new, ensuring no capacity drop.

These settings implement zero-downtime rolling deployments: new tasks start before old tasks stop, circuit breaker rolls back if new tasks fail, and you never drop below desired capacity. This configuration provides most of blue-green's benefits (instant rollback, no capacity drop) without the complexity of managing separate target groups.

Deployment strategy decision matrix

Choose deployment strategies based on your application's risk tolerance, traffic patterns, and operational requirements:

Use rolling deployments with circuit breakers for most applications.

They provide good safety (automatic rollback), zero downtime (tasks overlap), and simple configuration. This should be your default.

Use blue-green deployments for critical systems.

When instant rollback is essential and infrastructure cost is less important than reliability. Financial systems, healthcare applications, e-commerce checkouts, anywhere downtime is unacceptable and budget supports double infrastructure during deployment.

Use canary deployments for high-traffic applications.

Where the regression signal needs real user traffic to surface. Recommendation algorithms, search ranking changes, and UI redesigns are the cases where tests pass but the change is bad anyway, and you only find out from how real users behave under it.

Combine strategies for maximum safety.

Run a canary on top of blue-green infrastructure: 5% of traffic on the green target group, monitor, ramp gradually, then cut blue out. You get blue-green's instant rollback and the canary's metric-gated validation, at the cost of running two stacks plus weighted listener rules. For mission-critical systems the combination is worth the complexity; for everything else it's overengineering.

Next, in section 8, we cover the human side: severity ladders, the runbook structure that turns intuition into a checklist, CloudWatch Logs Insights queries for systematic debugging, and the blameless postmortem template.