4. Retry with backoff and jitter
Networks stutter. APIs time out. Services go down for thirty seconds and come back. The categorizer from the last page already knows which failures are transient; the piece missing is the logic that actually retries them. Done well, the user never notices the first attempt failed. Done poorly, your client hammers a recovering service back into the ground. This page is the well-done version: exponential backoff, jitter, a cap on attempts, and honoring Retry-After when the server tells you exactly how long to wait.
The core principle
Only retry failures that are likely temporary. Never retry user-input errors (permanent problems) or not-found errors (the resource doesn't exist). The categorization system from the last page already encodes this logic: only the transient category triggers retry. Selective retry prevents wasted attempts on failures that won't resolve, while giving temporary issues the time they need.
Exponential backoff: give services breathing room
When a service is struggling, hitting it again immediately makes the problem worse. Exponential backoff increases wait times between retries, giving the service time to recover:
The wait time doubles after each failure: 1s, 2s, 4s, 8s. This exponential growth gives struggling services progressively more time to recover while preventing your application from hammering a failing service.
import time
def calculate_backoff_delay(attempt, base_delay=1.0):
"""Calculate exponential backoff delay for retry attempt."""
return base_delay * (2 ** attempt)
# Usage
for attempt in range(4):
delay = calculate_backoff_delay(attempt)
print(f"Attempt {attempt + 1}: wait {delay}s before retry")
# Output:
# Attempt 1: wait 1.0s before retry
# Attempt 2: wait 2.0s before retry
# Attempt 3: wait 4.0s before retry
# Attempt 4: wait 8.0s before retry
This pattern is used universally in production systems because it balances quick recovery (early retries happen fast) with service protection (later retries give more recovery time).
Jitter: prevent thundering herds
Exponential backoff alone has a critical flaw: if many users hit an error simultaneously, they'll all retry simultaneously. This creates synchronized retry waves (thundering herds) that can extend outages or even cause cascading failures.
12:00:00 - API goes down, 1000 users all fail simultaneously
12:00:01 - ALL 1000 users retry at exactly the same time (thundering herd!)
12:00:03 - ALL 1000 users retry again at exactly the same time
12:00:07 - ALL 1000 users retry again at exactly the same time
Every retry hits the service with the full load of all failed users at once. This synchronized load can overwhelm the service even after it's recovered, extending the outage.
12:00:00 - API goes down, 1000 users all fail simultaneously
12:00:01 - Users retry spread across 1.0-1.5 seconds (distributed load)
12:00:03 - Users retry spread across 2.0-3.0 seconds (distributed load)
12:00:07 - Users retry spread across 4.0-6.0 seconds (distributed load)
Jitter adds random variation to wait times, spreading retries over time. Instead of 1000 simultaneous requests, you get a distributed load the service can handle.
Why the big clouds all do this.
Every major cloud SDK (AWS, Google Cloud, Azure) ships retry logic with both exponential backoff and jitter, because their networks have felt what synchronized retries do to a recovering service: the client herd hits the service the instant it comes back up, pushes it back down, and the outage extends. The fix is a single line: wait = base_delay * (2 ** attempt) + random(0, base_delay * 0.5). That's all jitter is. One random component, added to the exponential wait, and the thundering herd becomes a gentle stream.
import time
import random
def calculate_backoff_with_jitter(attempt, base_delay=1.0):
"""Calculate exponential backoff with jitter."""
exponential_delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, base_delay * 0.5)
return exponential_delay + jitter
# Usage - notice the random variation
for attempt in range(4):
delay = calculate_backoff_with_jitter(attempt)
print(f"Attempt {attempt + 1}: wait {delay:.2f}s before retry")
# Output (random variation each time):
# Attempt 1: wait 1.23s before retry
# Attempt 2: wait 2.41s before retry
# Attempt 3: wait 4.17s before retry
# Attempt 4: wait 8.38s before retry
The jitter (±50% random variation) is just enough to desynchronize retries without significantly affecting recovery time. This simple addition prevents thundering herds.
Complete retry implementation
Here's a complete retry function that combines categorization, exponential backoff, jitter, and max attempts:
import time
import random
import requests
import logging
def retry_with_backoff(func, *args, max_attempts=3, base_delay=1.0, **kwargs):
"""
Retry function with exponential backoff and jitter.
Only retries transient failures. Honors Retry-After on 429 responses.
"""
attempt = 0
while attempt < max_attempts:
try:
result = func(*args, **kwargs)
return result, None
except Exception as e:
# Categorize the error
category, error_type, context = categorize_error(e)
# Only retry transient errors
if category != "transient":
# Not retryable - return error immediately
message = compose_error_message(category, error_type, **context)
return None, message
attempt += 1
# Reached max attempts?
if attempt >= max_attempts:
message = compose_error_message(category, error_type,
attempts=attempt, **context)
return None, message
# Honor Retry-After when the server provides one (429 only).
# Otherwise, use exponential backoff with jitter.
if error_type == "rate_limit" and context.get("retry_seconds"):
delay = float(context["retry_seconds"])
else:
exponential_delay = base_delay * (2 ** (attempt - 1))
jitter = random.uniform(0, base_delay * 0.5)
delay = exponential_delay + jitter
# Log for developers
logging.warning(
f"Attempt {attempt} failed: {error_type}. "
f"Retrying in {delay:.1f}s..."
)
# Wait before retry
time.sleep(delay)
# Should never reach here, but handle it anyway
return None, "Maximum retry attempts exceeded"
This function encapsulates the complete retry logic: it categorizes errors, only retries transient failures, uses exponential backoff with jitter, respects max attempts, and logs retry attempts for debugging.
Retry in action: what the user sees
From the user's perspective, retry logic makes temporary failures nearly invisible:
Enter city name: Tokyo
Looking up coordinates for 'Tokyo'...
Connection issue. Retrying in 1.3 seconds...
Found: Tokyo, Tokyo, Japan
Fetching weather data...
Temperature: 18°C
Conditions: Partly cloudy
The network stuttered, the application waited 1.3 seconds (1s base + random jitter), retried automatically, and succeeded. The user barely noticed - just a brief pause and a helpful status message. This is production-grade error handling: failures happen, but users can still complete their tasks.
Enter city name: Lndon
Looking up coordinates for 'Lndon'...
We couldn't find weather data for "Lndon".
Please check the spelling or try a nearby city.
Examples: London, Dublin, Manchester
No retry happened - the error was categorized as "not_found", which isn't transient. The user received immediate, actionable feedback without waiting through pointless retry attempts.
Special case: respecting rate limits
When APIs return 429 (Too Many Requests) with a Retry-After header, honor it exactly. This is different from standard exponential backoff - the API is telling you precisely when to retry:
def handle_rate_limit(response):
"""Handle 429 rate limit with Retry-After header."""
if response.status_code == 429:
retry_after = response.headers.get('Retry-After')
if retry_after:
# Retry-After is in seconds
wait_seconds = int(retry_after)
logging.info(f"Rate limited. Waiting {wait_seconds}s as instructed.")
time.sleep(wait_seconds)
return True # Should retry
else:
# No Retry-After header - fall back to exponential backoff
return True # Should retry with standard backoff
return False # Not a rate limit
Being a good API citizen
Honoring Retry-After headers isn't just polite; it prevents your application from being banned or throttled more aggressively. APIs rate-limit to protect their infrastructure, and respecting those limits is what tells the service you're a well-behaved client. Ignoring them risks getting your API key revoked or your IP blocked entirely, at which point the cost of the next fix is far higher than a 60-second wait.
Blocking vs non-blocking retry
time.sleep() blocks the whole process
The retry logic shown here uses time.sleep(), which halts the entire process during each wait. That's fine for CLI tools, scripts, and low-traffic applications — the only thing waiting is the one request you're retrying. In a web application handling multiple users at once, it's a problem: the process can't serve other requests while it's sleeping, so one slow retry cascades into latency for every user who happens to be on the server at the time.
For high-concurrency applications, use a non-blocking approach instead:
- Async/await with asyncio. The pattern for Python web frameworks like FastAPI or aiohttp;
await asyncio.sleep()yields control back to the event loop during the wait. - Task queues (Celery, RQ). Move the retry work onto a background worker, so the request handler returns immediately and the retries happen out of band.
- Message queues with dead-letter queues. The durable version of the above for multi-service architectures, where a failed message is routed to a separate queue for inspection or delayed retry.
Blocking retry is the right shape for CLI tools and low-traffic applications — which is what the Weather Dashboard is. When you eventually ship a web service with concurrent users, swap time.sleep() for asyncio.sleep() (or delegate retries to a queue) and the rest of the pipeline stays the same.