Chapter 9: Production Error Handling

1. From working code to reliable software

In this chapter you'll turn the Chapter 8 weather dashboard into production code -- software that catches typos before they crash, retries past transient network failures, and shows users a recoverable message instead of a stack trace. By the end you'll have an error categoriser, a retry loop with exponential backoff and jitter, and a three-part message system that turns failures into actionable guidance.

The Chapter 8 dashboard works beautifully when the network behaves and users type carefully. The moment you share it with anyone, the bug reports start: "it crashed when I typed 'Lndon'", "the app froze for 30 seconds", "I got something called a KeyError, what is that?" Production code isn't code that never fails; it's code that fails in ways users can recover from.

Terminal

Traceback (most recent call last):
  File "/home/user/weather_dashboard.py", line 156, in get_weather_for_city
    latitude, longitude, location_name = self.find_location(city_name)
  File "/home/user/weather_dashboard.py", line 34, in find_location
    location = data["results"][0]
KeyError: 'results'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/weather_dashboard.py", line 203, in <module>
    dashboard.run()
  File "/home/user/weather_dashboard.py", line 178, in run
    weather_data = self.get_weather_for_city(city_name)
  File "/home/user/weather_dashboard.py", line 162, in get_weather_for_city
    raise ConnectionError(f"Failed to get weather data: {str(e)}")
ConnectionError: Failed to get weather data: 'results'

That is the unhandled version: a user types a city name the geocoding API doesn't recognize, a KeyError fires inside your parse code, your own except clause reraises it as a ConnectionError, and the dashboard dies. The user sees gibberish. You see a support email with no useful detail. Nothing about the problem is actually a bug in your code. Every line executed correctly. The dashboard just never considered the possibility that the response might look different from the happy path.

Production error handling isn't about preventing failures. It's about making them predictable, recoverable, and user-friendly. Networks time out. APIs go down. Users make typos. Treat these as inevitable events your application must handle gracefully, not bugs to eliminate.

Some of what you're about to build is already inside the APIClient class from Chapter 7: the retry loop, the 429 handling with Retry-After, and the timeout recovery all live inside _make_request. This chapter opens the hood. You'll build those patterns from scratch so you understand what APIClient is doing behind the tuple return, and so you can apply them standalone when you're working against a library that raises exceptions directly (like requests in this chapter's worked example). The categorisation and message-composition layers are new — those sit above any base class, including APIClient, and they're what turn a raw exception into something a user can act on.

What you'll learn

Spot the four failure modes every networked application has to handle
Write three-part error messages that tell users what happened, what to do, and show an example
Categorize any exception into one of four actionable types, regardless of which library raised it
Use input validation before network calls so typos don't look like data problems
Retry transient failures with exponential backoff and jitter, and honor Retry-After when the API sends one
Log technical detail for developers while showing friendly prose to users, from the same error handler
Test every error path without touching the network, using pytest plus unittest.mock

What you'll build

message_templates.py — three-part message templates for every error category
error_categorizer.py — the categorize_error function that maps any exception to a category
input_validation.py — fail-fast checks that run before a single network byte is sent
retry_handler.py — exponential backoff with jitter, plus Retry-After support
error_logger.py, error_handler.py — dual-audience logging that serves users and developers from one place
test_weather_dashboard.py and friends — a full reliability test suite that never hits a real API
weather_dashboard.py — the Chapter 8 app, rebuilt with every piece wired together

The four error categories

Your dashboard makes two network calls for every query: one to the geocoding API to find coordinates, another to the weather API to fetch conditions. Each call can go wrong in many ways, but the handling code needs four stable categories. A category is the label the application uses to decide what happens next.

Category	What happens	Current behavior (no handling)
`user_input`	Empty string, typos, nonsense text	`KeyError` or `ConnectionError` with a stack trace
`transient`	Timeout, Wi-Fi drops, DNS issues, rate limits, service overload	`requests.exceptions.Timeout` after a 10-second hang
`not_found`	City doesn't exist, no search results	`KeyError` when the code reaches for an empty results list
`unknown`	Unexpected exceptions, parsing surprises, cases the application has not learned yet	Generic exception handling or a raw stack trace

Each category needs a different response. user_input needs immediate feedback with an example. transient problems might resolve if you wait and retry. not_found needs the user to try different input, not a retry. unknown needs a safe generic message for the user and enough logging for you to investigate. Professional error handling distinguishes between these cases; hobby code treats them all the same way, which is to say, it doesn't treat them at all.

Failures are events, not bugs

This shift in framing does more work than it looks like. A city name that isn't found isn't a crash; it's a chance to suggest "did you mean London?" A network timeout isn't the end of a session; it's a temporary hiccup that retry logic handles without the user ever seeing it. Once you stop reading errors as "something I got wrong in my code" and start reading them as "the world being the world", the shape of your application changes. Every external call gets wrapped. Every user input gets validated. Every failure produces two outputs: one for the user's screen, one for the developer's logs.

The rest of the chapter is five patterns that do exactly this, applied in order: user messages (section 2), categorization (section 3), retry (section 4), logging (section 5), and testing (section 6). Section 7 wires all five into your Weather Dashboard; section 8 is a review. Each pattern stands on its own, so you can borrow pieces for other projects without taking the whole bundle.