5. Auto-scaling
Auto-scaling matches capacity to demand. ECS Fargate scales tasks based on CPU, memory, request count, or any CloudWatch metric you point it at.
The Chapter 28 service pins desired-count to 2. That's wrong in both directions: under quiet hours the tasks sit idle (you pay for capacity nobody is using), and under a real spike two tasks aren't enough (latency climbs and tail requests time out). Auto-scaling is the policy layer that moves desired-count up and down automatically based on a CloudWatch metric.
ECS's auto-scaling story is target-tracking policies: a thermostat for the service. You pick a metric and a target value (for example, "maintain average CPU at 70%"); Application Auto Scaling computes how many tasks satisfies the target and adjusts desired-count to get there. The actual savings vary with traffic shape, so there's no honest universal "X% cheaper" claim, but the basic guarantee holds: capacity follows load rather than being provisioned for the peak.
How ECS service auto-scaling works
ECS Service Auto Scaling adjusts your ECS service's desired task count based on CloudWatch metrics. When CPU utilization rises above your target, ECS launches more tasks. When CPU falls below target, ECS terminates tasks. This happens automatically without manual intervention.
Target-tracking scaling
You declare the target ("average CPU = 70%") and AWS does the arithmetic. CPU rises to 85%, the policy launches more tasks until the average drops back to 70%; CPU falls to 40%, the policy terminates tasks until the average comes back up.
Scale-out vs scale-in
The two directions are deliberately asymmetric. Scale-out is fast (new tasks within 60-90 seconds of detecting high load) because the cost of being undersized is users seeing timeouts. Scale-in is slow (5-15 minutes of sustained low load before terminating anything) because the cost of being prematurely undersized when traffic resumes is the same problem, just delayed.
Cooldown periods
The cooldown is the wait between consecutive scaling decisions. After scale-out, the policy holds for a minute or so to give the new tasks time to start, pass health checks, and absorb some traffic before re-evaluating. After scale-in, it holds for longer (the chapter uses 300 seconds) to confirm load really is down. Without cooldowns, the policy overshoots in one direction and then the other.
Configuring target-tracking auto-scaling
You'll configure auto-scaling to maintain 70% average CPU utilization across all tasks. This gives you headroom for normal traffic variability while triggering scaling before performance degrades.
Register your ECS service as a scalable target:
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/news-api-cluster/news-api-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 10 \
--region us-east-1
This registers your ECS service with Application Auto Scaling, setting minimum capacity (2 tasks always running) and maximum capacity (never exceed 10 tasks). These limits prevent scaling from going too low (always maintain minimum availability) or too high (control maximum cost).
Create the target-tracking policy:
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/news-api-cluster/news-api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name news-api-cpu-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://scaling-policy.json \
--region us-east-1
Scaling policy configuration:
{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}
What this configuration means:
TargetValue: 70.0: Maintain 70% average CPU utilization across all tasks. When CPU exceeds 70%, scale out. When CPU falls below 70%, scale in.
ScaleOutCooldown: 60: After scaling out, wait 60 seconds before making another scaling decision. This gives new tasks time to start and begin handling traffic. Without cooldown, scaling decisions might overshoot by launching too many tasks.
ScaleInCooldown: 300: After scaling in, wait 5 minutes before making another scaling decision. This longer cooldown prevents premature scale-in during temporary load drops. If traffic spikes again quickly, you don't want to have just terminated tasks.
Verify scaling configuration:
aws application-autoscaling describe-scaling-policies \
--service-namespace ecs \
--resource-id service/news-api-cluster/news-api-service \
--region us-east-1
# Output shows your scaling policy configuration
Load-testing to trigger auto-scaling
The best way to validate auto-scaling works is to trigger it with real load. You'll use Apache Bench (ab) to generate HTTP requests, watch CPU utilization rise, and observe ECS launching additional tasks automatically.
Generate sustained load:
# Install Apache Bench if needed
sudo apt-get install apache2-utils # Ubuntu/Debian
brew install apache-bench # macOS
# Generate load: 10000 requests, 50 concurrent connections
# The /articles endpoint requires an API key (covered in Chapter 26)
ab -n 10000 -c 50 -H "Authorization: Bearer YOUR_KEY" \
http://your-alb-url.amazonaws.com/articles
# This generates sustained traffic for several minutes
Watch scaling in real-time:
In another terminal window, monitor your ECS service:
# Watch task count change every 10 seconds
watch -n 10 'aws ecs describe-services \
--cluster news-api-cluster \
--services news-api-service \
--query "services[0].{Running:runningCount,Desired:desiredCount}" \
--output table'
# Initially shows:
# Running: 2 Desired: 2
# After ~90 seconds of high load:
# Running: 4 Desired: 4
# After sustained load:
# Running: 6 Desired: 6
Observe CPU metrics in CloudWatch:
Open your CloudWatch dashboard. You'll see CPU utilization spike to 85-95% when load testing begins. After 60-90 seconds, desired task count increases and new tasks start. As new tasks come online and handle traffic, average CPU across all tasks falls back toward 70%. If you maintain load long enough, you'll eventually hit your maximum of 10 tasks.
Stop load testing and watch scale-in:
Stop the Apache Bench process. CPU utilization drops immediately. After 5 minutes (scale-in cooldown), ECS begins terminating excess tasks. Over 10-15 minutes, task count gradually returns to the minimum of 2 tasks.
That asymmetry is the whole point: rapid scale-out so users never see capacity-driven slowness, gradual scale-in so a quick lull doesn't strand the service short of capacity when traffic resumes.
Auto-scaling only works as well as the container's CPU/memory allocation lets it. If the allocation is far above what the container actually needs (4 GB allocated, 1 GB used), the policy sees low utilisation and never scales. If it's too low (0.25 vCPU allocated for a workload that needs 0.5 vCPU), CPU pegs at 100% under any load and the policy chases it forever. Look at a week of real usage: if CPU averages under 30%, shrink the allocation; if it regularly hits 80%, raise it. Right-sizing comes first, then auto-scaling on top.
Hands-on: observe auto-scaling behaviour
Now that auto-scaling is configured, test it systematically to understand its behavior and verify it works correctly.
Watch your infrastructure scale from 2 tasks to 6-8 tasks under load, then scale back down to 2 tasks after load stops. You'll observe scaling latency (how long scale-out takes), cooldown behavior (why scale-in is slower), and cost implications (additional tasks only run when needed).
Step 1: Establish baseline
# Record starting state
echo "=== Baseline State ==="
echo "Time: $(date)"
aws ecs describe-services \
--cluster news-api-cluster \
--services news-api-service \
--query 'services[0].{Running:runningCount,Desired:desiredCount}' \
--output table
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=news-api-service Name=ClusterName,Value=news-api-cluster \
--start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average \
--query 'Datapoints[0].Average'
# Expected: 2 running tasks, CPU around 20-40%
Step 2: Start sustained load
# Get your ALB URL
ALB_URL=$(aws elbv2 describe-load-balancers \
--names news-api-alb \
--query 'LoadBalancers[0].DNSName' \
--output text)
echo "Starting load test against: $ALB_URL"
echo "This will run for ~5 minutes generating sustained traffic"
# Generate load: 10000 requests, 50 concurrent
ab -n 10000 -c 50 -H "Authorization: Bearer YOUR_KEY" \
http://$ALB_URL/articles
# Keep this window open and running
Step 3: Monitor scaling in real-time
Open a second terminal window and run this monitoring script:
# Watch task count change every 10 seconds
watch -n 10 'echo "=== $(date) ===" && \
aws ecs describe-services \
--cluster news-api-cluster \
--services news-api-service \
--query "services[0].{Running:runningCount,Desired:desiredCount,Deployments:deployments[0].status}" \
--output table'
# You should see:
# T+0:00 - Running: 2, Desired: 2
# T+1:30 - Running: 2, Desired: 4 (scaling decision made!)
# T+2:00 - Running: 3, Desired: 4 (new tasks starting)
# T+2:30 - Running: 4, Desired: 4 (scale-out complete)
# ... may continue scaling to 6-8 tasks depending on load
Step 4: Observe dashboard metrics
While load test runs, open your CloudWatch dashboard:
- Watch CPU utilization spike from 30% to 85-95%
- After 60-90 seconds, CPU should start dropping as new tasks come online
- Traffic widget shows sustained request volume
- Latency might spike slightly during scale-out (normal)
Step 5: Stop load and observe scale-in
After load test completes (or press Ctrl+C to stop it early):
# Continue watching task count
# Scale-in is MUCH slower than scale-out
# Expected timeline:
# T+0:00 - Load stops, CPU drops to 10-20%
# T+5:00 - Still at 6-8 tasks (scale-in cooldown preventing action)
# T+8:00 - Desired count drops to 4 (first scale-in decision)
# T+13:00 - Desired count drops to 3
# T+18:00 - Back to minimum 2 tasks
# This gradual scale-in prevents thrashing if load resumes
Step 6: Record observations
Document what you observed:
- Scale-out time: How long from load start to first new task? (Should be 60-120 seconds)
- Maximum tasks: How many tasks ran at peak? (Likely 4-8 depending on load intensity)
- Scale-in time: How long from load stop to return to minimum? (Should be 15-20 minutes)
- Cost impact: 6 extra tasks for 5 minutes = 0.5 hours of extra compute = ~$0.15
You've watched the service add tasks under load and shed them once the load went away. The shape of the response is the deliverable: scale-out in ~90 seconds, scale-in over ~15 minutes, no thrashing. The bill matches that shape -- you pay for the extra tasks only while they were running.
Q1: Why does scale-out happen in 60-90 seconds but scale-in takes 15-20 minutes?
A1: Asymmetric cooldown periods. AWS prioritizes availability over cost, adding capacity quickly prevents outages, removing capacity slowly prevents thrashing if load resumes. The 60-second scale-out cooldown gives new tasks time to start. The 300-second scale-in cooldown ensures load has truly subsided before removing capacity.
Q2: What would happen if you set minimum capacity to 1 task instead of 2?
A2: During deployments, your API would have zero healthy tasks briefly (single task stops, replacement starts, health checks pass). Zero tasks means downtime. Minimum of 2 ensures at least one healthy task during deployments.
Q3: How would you prevent auto-scaling from adding too many tasks during a load test?
A3: Set max-capacity to a reasonable limit (like 10 tasks) to control maximum cost. Auto-scaling respects this limit, it won't scale beyond max even if CPU stays high.
Advanced scaling strategies
CPU-based scaling works well for compute-bound applications, but other metrics might be more appropriate for your workload patterns.
Memory-based scaling:
Applications handling large request payloads or caching data in memory might exhaust memory before CPU. Target-tracking policies can use memory utilization instead of CPU:
{
"TargetValue": 75.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}
Request-count-based scaling:
For applications where load correlates directly with request count (simple CRUD APIs), scale based on requests per task:
{
"TargetValue": 1000.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/news-api-alb/50dc6c495c0c9188/targetgroup/news-api-tg/0123456789abcdef"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}
This policy keeps the per-task request rate near 1000 per minute. When traffic exceeds that, the policy adds tasks so the per-task load comes back down. Unlike the CPU and memory metrics, ALBRequestCountPerTarget is not self-describing: you must supply a ResourceLabel naming the load balancer and target group to measure, or put-scaling-policy rejects the request. The value has the form app/<alb-name>/<alb-id>/targetgroup/<tg-name>/<tg-id>, which you read off the tail of the ALB and target-group ARNs created in Chapter 28; substitute your own IDs for the placeholders above.
Custom metric scaling:
For workloads where neither CPU nor request count is the right signal -- queue depth, active WebSocket connections, cache eviction rate -- a custom CloudWatch metric published from the application drives the scaling policy directly. The setup is more work (the metric has to be emitted from the code path that knows about it), but the policy ends up tracking the signal that actually predicts when load is hurting users.
Next, in section 6, we look at where the AWS bill actually goes for this deployment, the high-leverage knobs (log retention, ECR lifecycle, billing alarms), and the tagging discipline that lets you ask "what does the news API cost this month?" without guessing.