Chapter 29: Production Operations and Automation
1. Production operations and automation
Chapter 28 put the API on AWS. This chapter is the next layer: the operational scaffolding that keeps it running for real users without manual intervention. CI/CD via GitHub Actions so deploys happen on git push; CloudWatch dashboards and alarms so you see what's happening; auto-scaling so capacity follows load; cost controls so the bill stays predictable; deployment patterns (blue-green, canary) for safer rollouts; runbooks and post-incident reviews for when things break; and SecOps habits (secrets rotation, image scanning, security-group audits) that compound over time.
- The six dimensions of production operations (deployment, observability, capacity, cost, incidents, security) and where each sits on the day-to-day cadence
- Building a GitHub Actions CI/CD pipeline that tests, builds an image, pushes to ECR, and triggers an ECS deploy on every merge to
main - The four golden signals (latency, traffic, errors, saturation) and how to wire each to a CloudWatch dashboard plus an alarm with a real action
- ECS auto-scaling via target-tracking policies, plus the load-test pattern that proves the policy works before production traffic does
- AWS cost optimisation: log-retention defaults, ECR lifecycle policies, billing alarms, and the tagging discipline that makes cost-by-team queries possible
- Deployment patterns beyond rolling updates: blue-green for instant cutover, canary for gradual rollout, ECS circuit breakers for automatic rollback
- Incident response: severity classification, runbook structure, CloudWatch Logs Insights queries for systematic debugging, blameless postmortems
- SecOps habits: ECR image scanning, non-root container users, read-only filesystems, secrets in Secrets Manager (not env vars), security-group audits
.github/workflows/deploy.yml— CI/CD pipeline (test, build, push to ECR, deploy to ECS)github-actions-policy.json— least-privilege IAM policy for the deploy IAM userdashboard-config.json— CloudWatch dashboard wiring the four golden signals to one viewscaling-policy.json+memory-scaling-policy.json+request-scaling-policy.json— target-tracking auto-scaling policies (CPU, memory, request count)- Cost guardrails: log-retention shortened, ECR lifecycle policy, billing alarms via CloudWatch
- Hardening: ECR scan-on-push enabled, Dockerfile with non-root user, task-definition with read-only filesystem and Secrets Manager references
How operations differs from development
The first 28 chapters of the book have the same shape: write code, run it, see the output. Operations runs on a different rhythm. The code is mostly already written; the work is keeping it healthy as traffic, dependencies, and the world around it change. Three properties make ops feel different from the dev work that preceded it:
- It runs in the background, not in response to a feature request. Nobody asks for a dashboard or a billing alarm; you build them so the absence-of-a-page-at-3am stays the default state.
- Failures cluster around boundaries, not bugs. Most production incidents trace to capacity (the database is at 90% CPU), dependencies (a third-party API is returning 503), or configuration drift (a recent deploy quietly changed an env var), not algorithmic bugs in your code.
- The cheapest fix is the one you don't have to wake up for. Alarms with real auto-remediation, auto-scaling that absorbs spikes, runbooks that turn intuition into checklist, deployment patterns that roll back automatically when error rates spike: these are what take ops from heroic effort to background process.
The shape of the chapter
Each sub-page takes one operational dimension and grounds it in the AWS deployment from Chapter 28. CI/CD wires GitHub Actions to the ECR + ECS deploy. Monitoring builds the CloudWatch dashboard plus three alarms. Auto-scaling configures the ECS service to grow and shrink with load. Cost optimisation closes the AWS-bill-runaway loop. Deployment patterns introduce blue-green and canary as upgrades over the rolling default. Incidents covers the human side: severity, runbooks, postmortems. SecOps closes the loop on hardening the running infrastructure.
How the chapter is laid out
- Understanding production operations. The dev-vs-ops contrast, the six operational dimensions, and how ops work shows up in your week.
- CI/CD with GitHub Actions. Workflow file, secrets management, the complete deploy pipeline, end-to-end test.
- Monitoring and observability. The four golden signals, dashboard JSON, three meaningful alarms, log queries.
- Auto-scaling. Target-tracking policies, load-test the policy, watch tasks scale in and out.
- Cost optimisation. Service-by-service spend, the high-leverage knobs, billing alarms.
- Advanced deployment patterns. Blue-green, canary, ECS deployment circuit breakers, and when each is worth the complexity.
- Incident response. Severity ladders, runbook structure, Logs Insights queries, blameless postmortems.
- Security operations. Image scanning, non-root containers, secrets in Secrets Manager, security-group audits.
- Review. Patterns to take forward, a quiz on the load-bearing decisions, and a forward pointer to Chapter 30.
Next, in section 2, we walk through the six operational dimensions explicitly so the rest of the chapter has a mental map of where each piece fits.