Chapter 29: Production Operations and Automation

1. Production operations and automation

Chapter 28 put the API on AWS. This chapter is the next layer: the operational scaffolding that keeps it running for real users without manual intervention. CI/CD via GitHub Actions so deploys happen on git push; CloudWatch dashboards and alarms so you see what's happening; auto-scaling so capacity follows load; cost controls so the bill stays predictable; deployment patterns (blue-green, canary) for safer rollouts; runbooks and post-incident reviews for when things break; and SecOps habits (secrets rotation, image scanning, security-group audits) that compound over time.

What you'll learn
  • The six dimensions of production operations (deployment, observability, capacity, cost, incidents, security) and where each sits on the day-to-day cadence
  • Building a GitHub Actions CI/CD pipeline that tests, builds an image, pushes to ECR, and triggers an ECS deploy on every merge to main
  • The four golden signals (latency, traffic, errors, saturation) and how to wire each to a CloudWatch dashboard plus an alarm with a real action
  • ECS auto-scaling via target-tracking policies, plus the load-test pattern that proves the policy works before production traffic does
  • AWS cost optimisation: log-retention defaults, ECR lifecycle policies, billing alarms, and the tagging discipline that makes cost-by-team queries possible
  • Deployment patterns beyond rolling updates: blue-green for instant cutover, canary for gradual rollout, ECS circuit breakers for automatic rollback
  • Incident response: severity classification, runbook structure, CloudWatch Logs Insights queries for systematic debugging, blameless postmortems
  • SecOps habits: ECR image scanning, non-root container users, read-only filesystems, secrets in Secrets Manager (not env vars), security-group audits
What you'll build
  • .github/workflows/deploy.yml — CI/CD pipeline (test, build, push to ECR, deploy to ECS)
  • github-actions-policy.json — least-privilege IAM policy for the deploy IAM user
  • dashboard-config.json — CloudWatch dashboard wiring the four golden signals to one view
  • scaling-policy.json + memory-scaling-policy.json + request-scaling-policy.json — target-tracking auto-scaling policies (CPU, memory, request count)
  • Cost guardrails: log-retention shortened, ECR lifecycle policy, billing alarms via CloudWatch
  • Hardening: ECR scan-on-push enabled, Dockerfile with non-root user, task-definition with read-only filesystem and Secrets Manager references

How operations differs from development

The first 28 chapters of the book have the same shape: write code, run it, see the output. Operations runs on a different rhythm. The code is mostly already written; the work is keeping it healthy as traffic, dependencies, and the world around it change. Three properties make ops feel different from the dev work that preceded it:

  • It runs in the background, not in response to a feature request. Nobody asks for a dashboard or a billing alarm; you build them so the absence-of-a-page-at-3am stays the default state.
  • Failures cluster around boundaries, not bugs. Most production incidents trace to capacity (the database is at 90% CPU), dependencies (a third-party API is returning 503), or configuration drift (a recent deploy quietly changed an env var), not algorithmic bugs in your code.
  • The cheapest fix is the one you don't have to wake up for. Alarms with real auto-remediation, auto-scaling that absorbs spikes, runbooks that turn intuition into checklist, deployment patterns that roll back automatically when error rates spike: these are what take ops from heroic effort to background process.

The shape of the chapter

Each sub-page takes one operational dimension and grounds it in the AWS deployment from Chapter 28. CI/CD wires GitHub Actions to the ECR + ECS deploy. Monitoring builds the CloudWatch dashboard plus three alarms. Auto-scaling configures the ECS service to grow and shrink with load. Cost optimisation closes the AWS-bill-runaway loop. Deployment patterns introduce blue-green and canary as upgrades over the rolling default. Incidents covers the human side: severity, runbooks, postmortems. SecOps closes the loop on hardening the running infrastructure.

How the chapter is laid out

  • Understanding production operations. The dev-vs-ops contrast, the six operational dimensions, and how ops work shows up in your week.
  • CI/CD with GitHub Actions. Workflow file, secrets management, the complete deploy pipeline, end-to-end test.
  • Monitoring and observability. The four golden signals, dashboard JSON, three meaningful alarms, log queries.
  • Auto-scaling. Target-tracking policies, load-test the policy, watch tasks scale in and out.
  • Cost optimisation. Service-by-service spend, the high-leverage knobs, billing alarms.
  • Advanced deployment patterns. Blue-green, canary, ECS deployment circuit breakers, and when each is worth the complexity.
  • Incident response. Severity ladders, runbook structure, Logs Insights queries, blameless postmortems.
  • Security operations. Image scanning, non-root containers, secrets in Secrets Manager, security-group audits.
  • Review. Patterns to take forward, a quiz on the load-bearing decisions, and a forward pointer to Chapter 30.

Next, in section 2, we walk through the six operational dimensions explicitly so the rest of the chapter has a mental map of where each piece fits.