Chapter 23: Asynchronous APIs and performance optimization

1. When async is the right call

Twenty-two chapters of API work, almost all of it sync: one request at a time, the function blocks, you get the answer. That model is the right call when the request count is one (a single API call, a sequential workflow where each step needs the previous answer, a server endpoint handling one request at a time). It is the wrong call when you have ten or fifty independent calls to make and waiting on each one in turn means waiting on all of them in series. Async is the tool for that second shape, and only that shape; by the end of this chapter you will have rebuilt Chapter 11's news aggregator with async fan-out so the cost of "many independent calls" goes from sequential to roughly the cost of the slowest single one.

Chapter 22 closed with a background worker that drained a webhook queue serially (one event, a two-second job, the next event). The same worker shape with async I/O can drain that queue in parallel: ten downstream calls overlap in the wall-clock that a single one used to take. That is the architectural composition Chapter 22's review pointed at, and it is what makes this chapter's tooling (asyncio plus httpx.AsyncClient) load-bearing rather than ornamental: the same patterns scale outbound aggregation past whatever a single sequential loop can handle.

What you'll learn

Articulate when sync is the right call and when async is, and answer the interview-shaped version of that question with a tradeoff rather than a slogan
Read and write coroutines: async def, await, the event loop, asyncio.run(), and the silent-bug shape of a missing await
Swap requests for httpx.AsyncClient with the smallest possible delta -- same parameters, same raise_for_status(), same response object
Spot the canonical async footgun: a sync call inside async def blocks the event loop and silently defeats the concurrency
Fan out concurrent requests with asyncio.TaskGroup (Python 3.11+) as the structured-concurrency primary, plus asyncio.gather for the older codebases you will inherit
Bound the parallelism with a semaphore so a 50-source aggregator respects rate limits instead of bursting straight through them
Handle partial failure: per-request timeouts, per-batch timeouts, and the difference between TaskGroup fail-fast and gather(..., return_exceptions=True) collect-everything
Test async code with pytest-asyncio so the patterns survive into production

What you'll build

first_async.py -- the five-minute proof: three concurrent requests, wall-clock equal to the slowest one
sync_three_calls.py -- the sequential baseline, run once so the cost ratio is concrete
sync_get.py + async_get.py -- side-by-side the same fetch in requests and httpx.AsyncClient
missing_await.py -- the silent-bug demo: coroutine returned, nothing ran, no error
concurrent_taskgroup.py + concurrent_gather.py -- the same fan-out in the modern primitive and the legacy one
with_semaphore.py + with_timeouts.py + rate_limiter.py -- production-shape concurrency control
graceful_degradation.py + async_retry.py -- partial-failure patterns that keep the aggregator alive when one source dies
news_fetchers.py + aggregator.py -- the keystone Async News Aggregator, end to end against NewsAPI, the Guardian, and Hacker News, plus benchmark_aggregator.py, a keyless harness that times the sync-vs-async difference

Sync versus async, by cost profile

Async is not "the modern way" or "what real backend engineers use." Most API integration work in this book has been sync and stays sync in production: a single request, a sequential workflow, a request handler that does one thing per call. Sync is simpler to read, simpler to debug, simpler to test. The decision is not stylistic, it is about the request count and the failure mode you can tolerate.

Three cases where sync requests stays the right call:

Single-request scripts. One call, one answer, done. Async adds an event loop and a context manager for zero benefit.
Sequential workflows. Each step needs the previous step's answer (Chapter 14's OAuth code-for-token exchange, Chapter 21's poll-until-OCR-finished). Concurrency does not help when the dependency chain is linear.
Server endpoints handling one request at a time. Flask + a sync ORM + a single downstream call is the right shape for most CRUD endpoints; the framework handles concurrency between requests for you.

Three cases where sync starts hurting and async earns its complexity:

Outbound aggregation across many endpoints. A dashboard that pulls from NewsAPI, the Guardian, and Hacker News one after the other waits for the sum of three latencies; the user waits the same sum. Concurrent fetch turns that into the slowest single latency.
Fan-out where each call is independent. Fifty product-detail lookups for a comparison view, ten weather-station readings for a region map, twenty webhooks that need to fire when a deployment lands. The calls do not need each other; they can overlap.
Latency-bounded paths that need fast multi-source assembly. Anywhere a user is waiting for a page that is composed from several services and the slowest one defines the experience.

And the inverse: cases where async is the wrong call. CPU-bound work (image resizing, hash computation, anything that keeps the interpreter busy) does not benefit from an event loop -- the GIL still serialises the work, and you want threads or processes instead. Code that is already fast enough does not earn the complexity. A team that has never debugged a coroutine should not learn on a production payment flow.

The interview-shaped version of the question

A common backend prompt: "When would you reach for async over sync requests?" The losing answer is "always, it is the modern way." The strong answer is shaped by cost profile:

"Sync is the default. I reach for async when I have many independent I/O-bound calls and the wall-clock cost of waiting on them in series is the bottleneck. The canonical case is outbound aggregation across multiple APIs -- a dashboard pulling from three or four sources, or a fan-out across 50 endpoints. The mechanics are asyncio plus httpx.AsyncClient; the structured-concurrency primitive is asyncio.TaskGroup. I bound the parallelism with a semaphore so I do not burst through rate limits, and I prefer per-request timeouts over per-batch ones so one slow source does not poison the whole call. The thing I would not do is reach for async on a single-request script or a CPU-bound job; both pay the complexity tax for no win."

The keystone in Section 6 hits the relevant beats for a three-source aggregator: plain asyncio.gather for the concurrent fan-out, httpx.AsyncClient for the transport, per-request timeouts so one slow source cannot define the wall-clock, and structured-response coroutines so one dead source does not kill the others. The semaphore-bounded fan-out and the return_exceptions=True shape live in Sections 4 and 5, where the request count and the inner-coroutine shape genuinely earn them; the keystone uses the simpler primitives that fit three independent providers. The chapter's argument is "match the primitive to the cost profile", and the keystone is the concrete instance of that.