8. Chapter review
Your API is live on AWS behind HTTPS, with managed Postgres and Redis, auto-deployed from ECR. This page summarises the deployment patterns and locks them in with a quiz.
The chapter started with a Compose stack on a laptop and ended with the same components running on AWS managed services: image in ECR, container on Fargate, Postgres on RDS, Redis on ElastiCache, traffic fronted by an ALB. The mapping is one-to-one; the engineering work was in the connective tissue: IAM scope for who can do what, security groups for what can talk to what, and a task definition that pins the deployment shape in version control.
The names are AWS-specific but the patterns aren't. A container registry feeding a container orchestrator behind a load balancer, with managed data services off to one side, is the shape of almost every cloud deployment regardless of provider. The next chapter takes this same stack and adds the operational layer on top: CI/CD, monitoring, scaling, cost.
Patterns to take forward
- Root account stays locked. MFA on the root login on day one, daily work through an IAM user with bounded permissions, billing budget wired up before anything else is provisioned. The pattern bounds the blast radius of any single credential leak.
- Image tagging is a deploy contract. Push all three tags on every release:
:latestfor the convenience handle, the semver tag for what the task definition pins to, and the commit-SHA tag for forensics. Task definitions never pin to:latestin production because re-pushing the tag silently changes what the next task starts. - Managed services for stateful tier, containers for stateless. RDS and ElastiCache run the data layer because nobody wants to write their own backup-and-failover code on a weekend. The API itself is stateless and disposable, which is what lets ECS replace tasks at will.
- Security group sources, not CIDR ranges. The RDS inbound rule's source is the ECS service's security group, not
0.0.0.0/0or even the VPC CIDR. That's how you keep the database reachable only by tasks in the service, not by anything that happens to share the VPC. - The task definition is the deployment. Committed to the repo, versioned through revisions, registered with the service. A rollback is a one-line change pointing the service at an older revision; redeploying is registering a new revision and updating the service. There's no out-of-band state to track.
- Secrets out of
environment, intosecrets. Anything in the task definition'senvironmentis visible to anyone with ECS read permission viadescribe-task-definition.secretstakes a Parameter Store ARN and resolves at task start, keeping the value out of the description. - ALB target groups, not direct container access. Tasks have ephemeral IPs and there's no place to attach a TLS certificate to a Fargate task. The ALB supplies the stable hostname, the health-check loop that decides which tasks get traffic, and the HTTPS listener.
Quiz
Test your understanding with these questions. If you can answer confidently, you've mastered the material:
Why use managed databases (RDS, ElastiCache) instead of running PostgreSQL and Redis in ECS containers?
RDS and ElastiCache do the things a containerised database makes you write yourself: nightly snapshots, point-in-time recovery, automatic failover within an AZ pair, online storage scaling, and patch windows. Running Postgres in a container means owning all of that, with no built-in automation when something fails at 3am. The cost gap is small at chapter scale; the operational gap is large.
Explain the ECS hierarchy: Cluster -> Service -> Task Definition -> Task. How do these components work together?
A cluster is a namespace. A task definition is the blueprint for one container (image, CPU/memory, env, log driver) and is versioned through revisions. A service is the control loop that keeps a target number of tasks running against a specific task definition revision. A task is one running instance. The chapter's deployment: cluster news-api-cluster holds service news-api-service, which runs news-api-task:1 at desired-count 2. If a task dies, the service starts a replacement; if you register news-api-task:2 and update the service, ECS rolls the tasks one at a time.
What problems do Application Load Balancers solve that direct container access doesn't?
Four things, all of which would otherwise be the application's problem. (1) A stable public hostname, since task IPs change on every restart. (2) Traffic distribution across all healthy tasks without the client having to know how many there are. (3) Health-check-driven failover; unhealthy tasks come out of the pool and replacements come in. (4) TLS termination on the ALB rather than per task, so certificate renewals don't redeploy the app. Connection draining on deploys is the fifth thing, and it's what makes rolling updates not drop in-flight requests.
How do security groups implement least-privilege access control in your AWS infrastructure?
Security groups act as virtual firewalls with default-deny policies: all inbound traffic is blocked unless explicitly allowed. You create inbound rules specifying protocol, port, and source to allow only necessary traffic. For example: RDS security group allows inbound PostgreSQL (port 5432) only from the ECS security group, blocking all other access. ALB security group allows HTTP/HTTPS from the internet (0.0.0.0/0), but ECS security group only allows traffic from the ALB security group, not direct public access. This implements defense in depth: even if one layer is compromised, other layers maintain security boundaries.
Why does the task definition include environment variables for database connection strings instead of hardcoding them in application code?
The same image runs in dev (localhost Postgres), staging (staging RDS), and production (production RDS) without a rebuild, because the connection string is supplied at task start instead of being baked into the binary. Hard-coding the URL would force a per-environment image and a rebuild on every password rotation. The actual production refinement here, as the chapter's tg-warning notes, is to put the password in secrets (Parameter Store) rather than environment, so the value never appears in describe-task-definition output.
What is the purpose of health check grace periods in ECS services, and why are they necessary?
Health check grace periods (e.g., 60 seconds) allow containers time to start up before health checks affect service stability. Applications need time to initialize: loading configuration, connecting to databases, warming caches, and becoming ready to serve traffic. Without grace periods, containers might fail health checks during legitimate startup, causing ECS to unnecessarily restart them in a failure loop. The grace period says "don't treat health check failures as problems for the first 60 seconds." After the grace period, health checks determine whether containers stay in the load balancer target group. This prevents false positives during startup while maintaining real health monitoring once containers are expected to be ready.
Why create separate security groups for the ALB, ECS containers, and databases instead of using one security group for everything?
Each tier opens to the tier immediately above it, and nothing else. The ALB security group accepts 80/443 from 0.0.0.0/0 because that's the only thing that should be reachable from the internet. The ECS security group accepts 8000 only from the ALB security group. The RDS security group accepts 5432 only from the ECS security group; same for ElastiCache on 6379. A compromised container can still hit the databases (it's supposed to) but can't bypass the ALB to talk to other services, and a compromised box outside the VPC can't reach the database at all.
What happens when an ECS task fails its health checks? Describe the automatic recovery process.
When a task fails health checks after the grace period: (1) The ALB marks the target as unhealthy and stops routing new requests to it, but allows existing connections to complete (connection draining). (2) ECS detects the unhealthy task and stops it. (3) ECS immediately starts a replacement task to maintain desired count. (4) The new task registers with the ALB target group and begins health checks. (5) After passing the healthy threshold (e.g., 2 consecutive checks), the ALB routes traffic to the new task. This entire process happens automatically without manual intervention. Users might experience slightly elevated latency during failover, but no downtime if multiple tasks are running.
Looking ahead
The deployment you just finished is operated by hand. Shipping a code change means: authenticate Docker with ECR, build, tag, push, register a new task-definition revision, update the service, watch the rollout, verify health. The sequence is fine the first time and tedious by the third. It's also fragile, because there's no record of who ran what when, and every step is a place a teammate can skip something.
Chapter 29 adds the operational layer on top of this stack:
- CI/CD with GitHub Actions -- a workflow that builds the image, tags it with the commit SHA, pushes to ECR, and updates the service on every merge to main, so the deploy sequence above runs from a git push instead of a terminal.
- CloudWatch monitoring -- dashboards and alarms on the four golden signals (latency, traffic, errors, saturation), and Logs Insights queries for searching across the task log streams.
- Auto-scaling -- target-tracking policies on the ECS service so the desired-count moves with load instead of being pinned at 2.
- Cost analysis -- service-by-service breakdown of where the Fargate, RDS, ALB, and ElastiCache hours actually land on the bill, and which knobs move the bill the most.
- Deployment patterns -- rolling, blue/green, and canary, with the trade-offs between them.
Next, in Chapter 29, we automate the manual ECR push and ECS update done here, add the monitoring and alarms that let the deployment tell you when something is wrong, and turn the fixed two-task service into one that scales with traffic.