8. Incident response and debugging

Incidents will happen. The question is how prepared you are. Runbooks save time at 3am; postmortems make the next one less likely.

Incidents are the things that aren't supposed to happen but do: an app crash, a connection pool that fills up, a third-party API that starts returning 503s, a traffic spike that lands faster than the auto-scaler can react. The question is never "will it happen" but "how quickly do we detect it, contain it, and learn from it."

The shape of a response has four stages: detect (the alarm fires), triage (decide how bad it is and who needs to know), diagnose (find the root cause), and resolve (fix it and restore service). Good monitoring shortens the first stage; good logging shortens the third; rollback automation shortens the fourth. The total clock from "detect" to "resolve" is MTTR (mean time to recovery), and lowering it is the single most useful operational metric to track over time.

Incident severity classification

Not all incidents are equal. Classify them by severity to guide response urgency:

SEV1 (Critical): Complete service outage

All users affected, core functionality unavailable. Drop everything and fix immediately. Examples: database crashed, all containers failing health checks, load balancer unreachable. Response time: minutes. On-call engineer immediately paged.

SEV2 (Major): Significant degradation

Some users affected, key features broken, but service remains partially available. Examples: one API endpoint returning 500 errors, database connection pool exhausted causing intermittent failures, auto-scaling failing to add capacity during traffic spike. Response time: within 1 hour. On-call engineer notified.

SEV3 (Minor): Limited impact

Few users affected, non-critical features broken, workarounds available. Examples: logging service down (app still works but debugging is harder), cache service degraded (slower but functional), monitoring dashboard unavailable. Response time: within 4 hours during business hours. Team notified, addressed during normal work hours.

SEV4 (Informational): No user impact but attention warranted

Examples: SSL certificate expiring in 30 days, disk usage at 70%, API rate limit warnings. Response time: non-urgent, addressed during normal planning. Tracked but not immediately actionable.

Severity classification guides response: SEV1 means wake people up at 3am. SEV4 means create a ticket for next sprint. Clear severity levels prevent both over-reaction (treating SEV4 like SEV1 causes burnout) and under-reaction (treating SEV1 like SEV4 angers users).

Systematic debugging with logs

When incidents occur, CloudWatch Logs Insights is your primary debugging tool. Container logs contain error messages, stack traces, request IDs, everything needed to understand what went wrong.

Example incident: Users reporting 500 errors on the /articles endpoint

Step 1 - Confirm the problem: Query for 500 errors in the time window users reported:

CloudWatch Logs Insights
fields @timestamp, @message
| filter @message like /500/
| filter @message like /articles/
| sort @timestamp desc
| limit 50

Results show error messages, timestamps, and affected endpoints. You confirm the problem exists and identify the time range.

Step 2 - Find error details: Look for stack traces or error messages:

CloudWatch Logs Insights
fields @timestamp, @message
| filter @message like /ERROR/ or @message like /Exception/
| filter @message like /articles/
| sort @timestamp desc

You find: "DatabaseError: connection pool exhausted, max_connections=10". The problem is clear: all database connections are in use, new requests can't get connections.

Step 3 - Understand root cause: Why is the connection pool exhausted? Query for connection-related messages:

CloudWatch Logs Insights
fields @timestamp, @message
| filter @message like /database connection/
| stats count() by bin(5m)

Results show connection acquisition spiked dramatically at 14:30, correlating with the incident start time. Cross-reference with CloudWatch metrics: CPU and task count look normal, but request rate doubled at 14:30. Conclusion: traffic spike exhausted database connections because connection pool was sized for normal traffic (10 connections), not peak traffic (needs 20+ connections).

Step 4 - Resolve: Immediate fix: increase database connection pool size. Long-term fix: implement connection pooling that scales with task count, or migrate database queries to async to reduce connection hold time.

The same four-step shape works for almost any incident: confirm the problem in the logs, find the error details, trace back to the cause, apply the fix. Logs Insights is what makes each step fast; a query runs in seconds and surfaces patterns across thousands of lines that no amount of tailing would catch.

Building incident response runbooks

Runbooks document step-by-step procedures for common incidents. When alarms trigger at 3am, you don't want to figure out debugging procedures from scratch. Runbooks provide checklists: "High CPU alarm triggered -> Check CloudWatch dashboard -> If p99 latency elevated, check for slow database queries -> If slow queries found, check query plan and indexes."

Effective runbook structure

  • Trigger: What alarm or symptom indicates this incident
  • Impact: How this affects users (helps determine severity)
  • Investigation steps: Specific CloudWatch queries, dashboard links, metrics to check
  • Common causes: Database connection exhaustion, Redis cache failure, external API timeout
  • Resolution procedures: Restart containers, increase connection pool, roll back deployment
  • Prevention: Long-term fixes preventing recurrence

Create runbooks for your most common alarms: high error rate, high latency, high CPU, failed health checks. Include exact CloudWatch Logs Insights queries, AWS CLI commands for checking status, and rollback procedures. Runbooks reduce MTTR by eliminating investigation guesswork.

Blameless post-incident reviews

After resolving incidents, conduct post-incident reviews (also called postmortems) documenting what happened, why it happened, and how to prevent recurrence. These reviews are blameless, their purpose is improving systems, not punishing individuals. Blame creates cultures where engineers hide problems instead of surfacing them.

Post-incident review template

Timeline: Chronological sequence of events with timestamps. "14:30 - High traffic spike. 14:32 - 500 error rate increased to 15%. 14:35 - CloudWatch alarm triggered. 14:40 - On-call engineer began investigation. 14:50 - Root cause identified (connection pool exhaustion). 14:55 - Increased pool size to 20. 15:00 - Error rate returned to normal."

Root cause: The underlying reason, not just the symptom. "Database connection pool sized for normal traffic (10 connections) couldn't handle traffic spike requiring 20+ concurrent connections."

Impact: User effect and duration. "15% of requests failed for 25 minutes, affecting approximately 500 users."

What went well: Things that worked correctly. "Alarm triggered within 5 minutes. Logs clearly showed connection pool exhaustion. Immediate fix applied successfully."

What went poorly: Things that didn't work. "Connection pool wasn't sized for peak traffic. No automated scaling of connection pool with task count."

Action items: Concrete improvements with owners and deadlines. "Implement dynamic connection pool sizing (Owner: Alice, Due: 2 weeks). Add connection pool utilization to monitoring dashboard (Owner: Bob, Due: 1 week). Update runbook with connection pool debugging steps (Owner: Carol, Due: 3 days)."

Share post-incident reviews widely. Over time the archive becomes institutional memory: patterns emerge across what used to look like unrelated incidents, common failure modes get documented in one place, and new team members learn from past mistakes without having to repeat them.

Next, in section 9, we close the chapter by hardening the running deployment: ECR scan-on-push, a non-root container user, a read-only root filesystem on the task, and secrets in Secrets Manager rather than the task definition's environment.