Chapter 29: Production Operations and Automation

1. Production operations and automation

Chapter 28 put the API on AWS. This chapter is the next layer: the operational scaffolding that keeps it running for real users without manual intervention. CI/CD via GitHub Actions so deploys happen on git push; CloudWatch dashboards and alarms so you see what's happening; auto-scaling so capacity follows load; cost controls so the bill stays predictable; deployment patterns (blue-green, canary) for safer rollouts; runbooks and post-incident reviews for when things break; and SecOps habits (secrets rotation, image scanning, security-group audits) that compound over time.

What you'll learn

The six dimensions of production operations (deployment, observability, capacity, cost, incidents, security) and where each sits on the day-to-day cadence
Building a GitHub Actions CI/CD pipeline that tests, builds an image, pushes to ECR, and triggers an ECS deploy on every merge to main
The four golden signals (latency, traffic, errors, saturation) and how to wire each to a CloudWatch dashboard plus an alarm with a real action
ECS auto-scaling via target-tracking policies, plus the load-test pattern that proves the policy works before production traffic does
AWS cost optimisation: log-retention defaults, ECR lifecycle policies, billing alarms, and the tagging discipline that makes cost-by-team queries possible
Deployment patterns beyond rolling updates: blue-green for instant cutover, canary for gradual rollout, ECS circuit breakers for automatic rollback
Incident response: severity classification, runbook structure, CloudWatch Logs Insights queries for systematic debugging, blameless postmortems
SecOps habits: ECR image scanning, non-root container users, read-only filesystems, secrets in Secrets Manager (not env vars), security-group audits

What you'll build

.github/workflows/deploy.yml — CI/CD pipeline (test, build, push to ECR, deploy to ECS)
github-actions-policy.json — least-privilege IAM policy for the deploy IAM user
dashboard-config.json — CloudWatch dashboard wiring the four golden signals to one view
scaling-policy.json + memory-scaling-policy.json + request-scaling-policy.json — target-tracking auto-scaling policies (CPU, memory, request count)
Cost guardrails: log-retention shortened, ECR lifecycle policy, billing alarms via CloudWatch
Hardening: ECR scan-on-push enabled, Dockerfile with non-root user, task-definition with read-only filesystem and Secrets Manager references

How operations differs from development

The first 28 chapters of the book have the same shape: write code, run it, see the output. Operations runs on a different rhythm. The code is mostly already written; the work is keeping it healthy as traffic, dependencies, and the world around it change. Three properties make ops feel different from the dev work that preceded it:

It runs in the background, not in response to a feature request. Nobody asks for a dashboard or a billing alarm; you build them so the absence-of-a-page-at-3am stays the default state.
Failures cluster around boundaries, not bugs. Most production incidents trace to capacity (the database is at 90% CPU), dependencies (a third-party API is returning 503), or configuration drift (a recent deploy quietly changed an env var), not algorithmic bugs in your code.
The cheapest fix is the one you don't have to wake up for. Alarms with real auto-remediation, auto-scaling that absorbs spikes, runbooks that turn intuition into checklist, deployment patterns that roll back automatically when error rates spike: these are what take ops from heroic effort to background process.

The shape of the chapter

Each sub-page takes one operational dimension and grounds it in the AWS deployment from Chapter 28. CI/CD wires GitHub Actions to the ECR + ECS deploy. Monitoring builds the CloudWatch dashboard plus three alarms. Auto-scaling configures the ECS service to grow and shrink with load. Cost optimisation closes the AWS-bill-runaway loop. Deployment patterns introduce blue-green and canary as upgrades over the rolling default. Incidents covers the human side: severity, runbooks, postmortems. SecOps closes the loop on hardening the running infrastructure.

How the chapter is laid out

Understanding production operations. The dev-vs-ops contrast, the six operational dimensions, and how ops work shows up in your week.
CI/CD with GitHub Actions. Workflow file, secrets management, the complete deploy pipeline, end-to-end test.
Monitoring and observability. The four golden signals, dashboard JSON, three meaningful alarms, log queries.
Auto-scaling. Target-tracking policies, load-test the policy, watch tasks scale in and out.
Cost optimisation. Service-by-service spend, the high-leverage knobs, billing alarms.
Advanced deployment patterns. Blue-green, canary, ECS deployment circuit breakers, and when each is worth the complexity.
Incident response. Severity ladders, runbook structure, Logs Insights queries, blameless postmortems.
Security operations. Image scanning, non-root containers, secrets in Secrets Manager, security-group audits.
Review. Patterns to take forward, a quiz on the load-bearing decisions, and a forward pointer to Chapter 30.

Next, in section 2, we walk through the six operational dimensions explicitly so the rest of the chapter has a mental map of where each piece fits.