10. Chapter review

Your API is now production-operable: automatic deploys, real-time monitoring, autoscaled capacity, controlled cost, prepared for incidents, and defended against attack. This page summarises the patterns and locks them in with a quiz.

What we covered

The chapter added the operational layer on top of the Chapter 28 deployment: a GitHub Actions pipeline that takes the manual ECR push and ECS update from Ch28 and runs them on every merge to main; a CloudWatch dashboard built around the four golden signals plus three alarms with documented response paths; target-tracking auto-scaling on the ECS service with min 2, max 10, and asymmetric cooldowns; cost guardrails (log retention, ECR lifecycle, billing alarms, resource tagging); ECS deployment circuit breakers for automatic rollback; a runbook structure for incident response with Logs Insights queries as the primary debugging tool; and hardening (ECR scan-on-push, non-root container user, read-only root filesystem, secrets via Parameter Store / Secrets Manager rather than the task definition's environment).

The connective tissue is that every piece is the same artefact shape: a JSON or YAML file committed to the repo. The CI/CD workflow, the dashboard config, the three scaling policies, the lifecycle policy, the hardened task definition -- all of them are reviewable, diffable, and rollback-able like the application code is. That's what makes the operational layer maintainable over months, not just demonstrable once.

Patterns to take forward

  • Deploys run from a git push, not a terminal. The pipeline -- test, build, tag with the commit SHA, push to ECR, register a task definition revision, update the service, wait for stability -- runs identically every time, gated by the test result. The manual sequence from Ch28 still works; it just isn't the path the team has to take.
  • Four metrics, not forty. Latency (P50/P90/P99), traffic, 4xx/5xx error counts, and saturation (CPU, memory, connection pools). Everything else is noise on a dashboard. The three alarms in this chapter (5xx count, P99 latency, CPU) fire on conditions that warrant pager attention; everything else is a graph to look at after the fact.
  • Auto-scaling is asymmetric on purpose. Scale-out cooldown 60s so the policy reacts to load before users notice; scale-in cooldown 300s so a transient lull doesn't strand the service short of capacity when traffic returns. The chapter's load test verifies the asymmetry actually plays out the way the config claims.
  • Cost guardrails are configuration, not vigilance. CloudWatch log retention capped (not "forever"), ECR lifecycle policy keeping the last 10 images, billing alarm at a sensible monthly threshold, every resource tagged with Environment and Project at creation. The bill stays explainable because the guardrails enforce the explanation.
  • Rolling updates with a circuit breaker as the default; blue-green and canary when the failure mode warrants the complexity. The circuit-breaker setting (enable: true, rollback: true with minimumHealthyPercent: 100, maximumPercent: 200) gives you most of blue-green's "auto-rollback on bad deploy" guarantee for the cost of one CLI update.
  • Every alarm has a runbook. The runbook says what triggered it, where to look in CloudWatch, the Logs Insights query to run, and the rollback or remediation steps. Alarms without runbooks get deleted, because they're creating noise rather than directing a response.
  • Postmortems are blameless and concrete. The template is timeline, root cause, impact, what went well, what went poorly, action items with owners and dates. The output is improvements to the system, not a record of who got blamed.
  • Secrets live in secrets[], not environment[]. Parameter Store or Secrets Manager ARNs in the task definition, resolved at task start by the execution role, recorded in CloudTrail. The plaintext value never appears in describe-task-definition output, which is the difference between "leaks on a console screenshot" and "doesn't."

Capstone: bring the operational layer together

The chapter introduced each operational dimension on its own page. The capstone is to wire all of them into the Chapter 28 deployment in one focused session and end up with a fully-operable news API. A reasonable order, with rough time budgets:

  • CI/CD pipeline (30-45 min). Land .github/workflows/deploy.yml, add the deploy IAM user credentials to GitHub Secrets, push a commit, and confirm the workflow runs through test, build, push to ECR, and ECS deploy.
  • CloudWatch dashboard and alarms (30 min). Apply dashboard-config.json for the four golden signals; add alarms for 5xx error rate, P99 latency, and CPU saturation; shorten log retention to 7 days.
  • Auto-scaling (20 min). Register the ECS service as a scalable target (min 2, max 10), apply scaling-policy.json at 70% CPU target tracking, run a small load test to watch tasks scale out and back in.
  • Cost guardrails (15 min). Add a billing alarm at a sensible monthly threshold, apply the ECR lifecycle policy from the cost page, tag every resource with Environment and Project.
  • Deployment safety (15 min). Enable the ECS deployment circuit breaker with rollback; set minimumHealthyPercent: 100, maximumPercent: 200; smoke-test by deploying intentionally broken code and watching the automatic rollback.
  • Security hardening (20 min). Enable ECR scan-on-push, move database credentials to Secrets Manager, audit security-group rules for any 0.0.0.0/0 on non-web ports.

Roughly two to three hours end to end. The output is a deployment whose operational properties are honestly defensible: deploys are automated, the system reports its own health, capacity follows demand, secrets are managed, and rollback is automatic.

Quiz

Test your understanding with these questions. If you can answer confidently, you've mastered the material:

Select question to reveal the answer:
Why implement CI/CD pipelines with GitHub Actions instead of deploying manually with AWS CLI commands?

The manual sequence works the first few times and then quietly stops working: someone forgets to run tests, someone builds with the wrong env var, someone deploys to staging when they meant prod. Every manual step is a step that can be skipped. The CI/CD pipeline replaces all of that with a fixed sequence (test, build, push, register task definition, update service, wait for stability) that runs identically every time and is gated on the test result, not on whoever is at a terminal. The workflow file lives in the repo, so the deploy process evolves under the same review discipline as the code it ships.

Explain the Golden Signals framework for monitoring. Why monitor these four metrics specifically?

The four signals (latency, traffic, errors, saturation) each correspond to a different failure mode. Latency surfaces slow downstream calls (slow queries, slow third-party APIs, contention on the host). Traffic surfaces usage shape -- both growth and unexpected drops, which often mean clients are giving up. Errors catch bugs, misconfigurations, and dependency failures. Saturation is the leading indicator: connection-pool full at 95% usually surfaces a couple of minutes before the latency and error signals do. The framework's value isn't comprehensiveness; it's that four metrics is few enough that the on-call can keep them in their head, and they cover most of what actually breaks in production.

How does target-tracking auto-scaling decide when to add or remove ECS tasks, and why use different cooldown periods for scale-out versus scale-in?

The policy holds the named metric at the target (70% CPU here): when the average crosses above, ECS adds tasks until it falls back; when it crosses below, ECS removes them. Cooldowns prevent the policy from acting on its own previous action. After scaling out, the 60-second cooldown lets new tasks start and pass health checks before another scaling decision is made; without it, the policy launches too many. After scaling in, the longer 300-second cooldown waits until the lull is real, because if traffic resumes and tasks were just terminated, the policy has put the service in a capacity hole. The asymmetry maps to the asymmetric cost of the two failures: scaling out too much is money; scaling in too soon is downtime.

What's the difference between blue-green deployments and rolling deployments, and when would you choose one over the other?

Rolling: stop one old task, start one new one, wait for the health check, repeat. Old and new versions overlap during the rollout, which is fine for backwards-compatible changes and uncomfortable for ones that aren't. Rollback means redeploying the previous task definition, which is also a rolling update -- slow if things are really broken. Blue-green: a parallel green stack is brought up alongside the running blue stack, validated against real load on a separate target group, then traffic cuts over at the ALB listener level. Rollback is one listener-rule update back to blue, in seconds. The cost is running two stacks for the duration of the deploy. The rule of thumb: rolling-with-circuit-breaker for everything by default; blue-green for the deploys where a brief capacity dip or a slow rollback would actually hurt.

You notice your AWS bill increased from $120 to $180 over two months. How would you investigate what drove the cost increase?

Open Cost Explorer, filter to the relevant window, group by service. That tells you which service line moved. If it's ECS, check task-count history and any CPU/memory allocation changes. If it's RDS, look for instance-class changes and storage growth. If it's data transfer, look at ALB traffic and any S3 movement. Cost allocation tags let you split the spike across environments (was it dev or prod?) or projects. Common causes: auto-scaling stayed scaled out longer than expected, a test resource was never deleted, log retention quietly accumulated months of data, or Free Tier expired. Once you've named the driver, the decision is whether it's growth (fine) or waste (delete it).

Your API starts returning 500 errors at 14:30 UTC. Walk through your incident response process using CloudWatch.

Confirm first: open the dashboard, compare error rate, latency, and traffic against baseline. Then go to Logs Insights and run fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 against the window around 14:30. Look for the dominant error message; the usual suspects are connection-pool exhaustion, external API timeouts, and OOM. Cross-reference Container Insights for CPU, memory, and task count to rule out saturation. Check ECS service events for task failures. Once the root cause is named (say, the connection pool maxed out at 10), apply the immediate fix (raise the pool size), push it through the CI/CD pipeline from section 3, and confirm the error graph drops. Postmortem captures the timeline, the cause, the fix, and the prevention work (in this example: pool-utilisation metric and alarm).

Why store secrets in AWS Secrets Manager instead of environment variables in ECS task definitions?

Anything in the task definition's environment field shows up in plaintext in describe-task-definition output, which is visible to anyone with ECS read permission. Secrets Manager (or Parameter Store) keeps the value encrypted at rest with KMS, gates access through IAM, and logs every retrieval in CloudTrail. When the task definition references the secret as an ARN under secrets[] instead, ECS fetches the value at task start via the execution role and injects it as an env var inside the container -- the container code sees the same env var it always did, but describe-task-definition only reveals the ARN. The other operational win is that rotating the secret doesn't require a redeploy; the next task that starts picks up the new value automatically.

What turns a one-shot deployment into something that runs reliably for months?

The difference is whether the operational layer exists at all. A one-shot deploy gets the code running and then nothing watches it; six months later, drift, secrets leakage, untracked spend, and silent regressions accumulate. The operational layer the chapter added has seven pieces, and each one is doing a specific job: CI/CD removes human error from the deploy step; the dashboard plus three alarms surface problems before users notice; auto-scaling sizes capacity to load; cost guardrails keep the bill explainable; deployment circuit breakers roll back automatically when a deploy goes bad; runbooks make incidents resolvable by anyone on call; secrets management plus security-group audits keep the credential and network surface tight. None of those is heroic on its own; together they're what makes the deployment maintainable.

Your CloudWatch dashboard shows latency spiking from 200ms to 2 seconds, but CPU and memory utilization remain at 40%. What would you investigate first?

Low CPU/memory with high latency points away from the compute tier and toward something downstream. Start with database query time: fields @timestamp, @message | filter @message like /query took/ | parse @message /query took (?<duration>\d+)ms/ | filter duration > 1000 | sort duration desc. If query times look normal, check the external APIs the news endpoint calls (News API, Guardian) for degradation. Then look at the Redis hit rate; a sudden drop in hit rate means every request is taking the slow uncached path, which is the same shape as a latency spike with healthy CPU. The general rule: latency without saturation is almost always a dependency problem, not a capacity problem.

Your ECS service is configured for auto-scaling from 2 to 10 tasks, but it never scales beyond 5 tasks even when CPU hits 95%. What could prevent scaling to maximum capacity?

Several configuration issues could limit scaling: (1) Task placement constraints, if you don't have enough availability zones or subnets configured, ECS can't place additional tasks even though the service allows 10. (2) Insufficient ENI capacity, each Fargate task requires an elastic network interface. VPC subnet CIDR blocks might be exhausted. (3) Service quotas, AWS accounts have default limits on concurrent Fargate tasks per region (often 100-500). Check service quotas in AWS Console. (4) Target tracking calculation lag, auto-scaling evaluates metrics every 60 seconds and acts conservatively. Five tasks might be technically sufficient to bring CPU below 70% target, so scaling stops there despite temporary 95% spikes. To diagnose: Check ECS service events for task placement failures. Review CloudWatch metrics to see if CPU actually stays above threshold long enough to trigger additional scaling. Verify VPC has sufficient IP addresses available. This troubleshooting process demonstrates systematic debugging: check configuration limits, check capacity constraints, check metric interpretation.

Looking ahead

Chapter 28 made the deployment work; Chapter 29 made it operable. Chapter 30 is the capstone: every piece from Chapters 1-29 gets pulled into one deployed news-aggregator project, with this chapter's CI/CD pipeline and CloudWatch monitoring as the operational substrate. The capstone is structured as three phases (core, production, extension) so you can decide how far to take it; the operational layer from this chapter ships in phase two alongside the AWS deploy.

Next, in Chapter 30, we lay out the capstone's architecture, the phase plan, and the evaluation criteria for calling the project done.