7. Rate limiting

Rate limits stop one client from consuming all your API's capacity. This chapter builds an in-memory limiter for a single-process tutorial deploy, then section 10 explains why real multi-worker traffic needs a shared store such as Redis.

Why rate limiting matters

Without a limit, one caller can monopolise the API. A script that loops at a thousand requests per second exhausts database connections and degrades response times for everyone else; a misconfigured client retrying every failure burns through your external quotas. Rate limiting is the cap that keeps a single misbehaving caller from becoming a denial of service for the rest.

The cap also has to be legible. Documenting the limit ("100 requests per hour for free tier, 1000 per hour for paid tier") and surfacing the current usage in response headers means a caller can pace themselves rather than discovering the cap by hitting it. When they do hit it, the 429 carries a Retry-After so they know exactly when to try again.

Fixed-window rate limiting

Fixed-window rate limiting divides time into fixed windows (e.g., one hour) and counts requests per window. If a user makes 100 requests in the current hour, request 101 gets blocked until the hour resets. This is simple to implement and reason about.

More sophisticated algorithms exist (sliding window, token bucket), but fixed-window provides good protection with minimal complexity. It's what most APIs use, including the ones you've worked with throughout this book.

rate_limit.py

from datetime import datetime, timedelta, timezone
from database import APIKey as DBAPIKey

# Rate limit tiers (requests per hour)
RATE_LIMITS = {
    "basic": 100,
    "premium": 1000,
    "unlimited": float('inf')
}

# In-memory request log for development.
# In production, replace this with Redis or another shared store.
REQUEST_LOGS: dict[int, list[datetime]] = {}


def check_rate_limit(api_key: DBAPIKey) -> tuple[bool, dict]:
    """
    Check and record usage for an API key.
    Returns (allowed: bool, info: dict with details)
    """
    tier = api_key.rate_limit_tier
    limit = RATE_LIMITS.get(tier, 100)
    
    # Calculate current window (start of current hour)
    now = datetime.now(timezone.utc)
    window_start = now.replace(minute=0, second=0, microsecond=0)
    
    # Keep only requests in the current window
    key_requests = REQUEST_LOGS.setdefault(api_key.id, [])
    key_requests[:] = [ts for ts in key_requests if ts >= window_start]
    request_count = len(key_requests)
    
    # Calculate when window resets
    window_reset = window_start + timedelta(hours=1)
    seconds_until_reset = int((window_reset - now).total_seconds())
    
    allowed = request_count < limit
    if allowed and limit != float('inf'):
        key_requests.append(now)
        request_count += 1
    
    info = {
        "limit": limit,
        "remaining": max(0, limit - request_count),
        "reset": window_reset.isoformat(),
        "reset_in_seconds": seconds_until_reset
    }
    
    return allowed, info

The tiers define the hourly budget for each kind of key. The in-memory request log keeps one timestamp list per API key, prunes entries outside the current hour, and counts what remains.

If the request is within budget, the limiter records the current timestamp and returns quota metadata. The endpoint uses that metadata for response headers so users know their limit, remaining requests, and reset time.

The X-RateLimit-* headers expose quota state on every response, so a caller doesn't need a separate endpoint to check where they stand. A well-written client reads X-RateLimit-Remaining and slows down before the limit lands; an unsophisticated one at least gets a 429 with a Retry-After rather than mysterious failures.

The header naming follows what GitHub, Twitter, and Stripe use; clients already written against any of those APIs will recognise the shape.

Applying the rate limiter to endpoints

Integrate rate limiting into the authentication dependency so every protected endpoint checks limits automatically.

rate_limit.py (extended)

# Add to rate_limit.py's existing imports:
from fastapi import Depends, HTTPException, Response
from auth import require_api_key


def require_api_key_with_rate_limit(
    response: Response,
    api_key: DBAPIKey = Depends(require_api_key),
) -> DBAPIKey:
    """
    Compose on require_api_key: layer the rate-limit check on top of the
    auth dependency. require_api_key runs first; if it raises 401 the body
    below never executes. No header parsing or hash lookup is duplicated.
    """
    allowed, info = check_rate_limit(api_key)

    response.headers["X-RateLimit-Limit"] = str(info["limit"])
    response.headers["X-RateLimit-Remaining"] = str(info["remaining"])
    response.headers["X-RateLimit-Reset"] = info["reset"]

    if not allowed:
        raise HTTPException(
            status_code=429,
            detail=f"Rate limit exceeded. Resets in {info['reset_in_seconds']} seconds",
            headers={
                "Retry-After": str(info['reset_in_seconds']),
                "X-RateLimit-Limit": str(info["limit"]),
                "X-RateLimit-Reset": info["reset"],
            },
        )
    return api_key

The rate-limit dependency lives in rate_limit.py and stacks on top of the auth dependency rather than re-running its checks. api_key: DBAPIKey = Depends(require_api_key) tells FastAPI to resolve the dependency imported from auth.py first; only if that returns a validated key does the rate-limit body below run. That one-way import keeps main.py free to compose routes without either helper importing it back.

The new dependency also receives the FastAPI Response object so it can attach X-RateLimit-* headers before the route returns. When the key is over budget, it raises 429 before the route body runs; the Retry-After header gives clients a machine-readable wait time.

Now wire the new dependency into the route:

main.py (updated)

# In main.py, replace the protected-route dependency import:
from fastapi import Query
from database import APIKey as DBAPIKey
from rate_limit import require_api_key_with_rate_limit


@app.get("/articles", response_model=ArticleListResponse)
def list_articles(
    category: str | None = None,
    source: str | None = None,
    limit: int = Query(20, ge=1, le=100),
    api_key: DBAPIKey = Depends(require_api_key_with_rate_limit),
    db: Session = Depends(get_db)
):
    query = db.query(DBArticle)

    if category:
        query = query.filter(DBArticle.category == category)
    if source:
        query = query.filter(DBArticle.source == source)

    articles = query.limit(limit).all()
    return {"articles": articles, "count": len(articles)}

Test rate limiting:

Terminal

# Check headers on normal request
curl -i -H "Authorization: Bearer YOUR_KEY" http://localhost:8000/articles
# HTTP/1.1 200 OK
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 99
# X-RateLimit-Reset: 2026-05-25T16:00:00+00:00

# Script to hit rate limit (basic tier = 100 requests/hour)
for i in {1..101}; do
  curl -H "Authorization: Bearer YOUR_KEY" http://localhost:8000/articles
done

# Request 101 returns:
# HTTP/1.1 429 Too Many Requests
# Retry-After: 1847
# X-RateLimit-Limit: 100
# X-RateLimit-Reset: 2026-05-25T16:00:00+00:00
# {"detail":"Rate limit exceeded. Resets in 1847 seconds"}

The 429 carries both the human-readable detail and the machine-readable Retry-After, so a client can show a useful message to the user while a retry loop reads the header and waits the right amount of time.

Next, in section 8, the auth and rate-limit dependencies meet the actual work of the API: the /articles endpoint that checks the cache, fans out to NewsAPI and the Guardian on a miss, and normalises both into one response shape.