4. Source-specific normalizers

A normalizer takes a raw API response and returns a list of Article objects in canonical format, paired with a meta dict carrying per-response bookkeeping like total-available counts. You write three of them: one complete walkthrough for NewsAPI, then deltas for Guardian and HackerNews so you see what changes per source rather than three near-identical files.

The five steps every normalizer runs

Each normalizer handles one API's quirks (field name mappings, nested structures, timestamp conversions, optional field extraction) using Chapter 10's defensive helpers. The shape stays the same; only the field access changes.

  1. Extract items: find the article array using extract_items_and_meta()
  2. Loop defensively: process each item with try/except; skip bad items rather than crashing
  3. Validate required fields: get title, URL, and timestamp; continue on any missing or empty value (source_name gets a default in step 4, so it's never blank by construction)
  4. Map fields: translate source field names to canonical names, including any format conversions (timestamps, source-name lookups, URL fallbacks)
  5. Construct Article: create the canonical object with the extracted values

One thing worth flagging up front: step 1's extract_items_and_meta normalizes each API's total-results key (totalResults / response.total / nbHits, per the comparison table on the previous page) into a canonical meta["total"] alongside the items. Every normalizer below reads that uniform key without branching on source, and the aggregator on the next page uses it for the "X of Y matching" display.

This page walks through one complete normalizer (NewsAPI) with detailed explanations, then presents the API-specific deltas for Guardian and HackerNews. Read NewsAPI carefully; skim the others for their unique adaptations.

NewsAPI normalizer (complete)

NewsAPI has the most straightforward structure of the three, which makes it the right place to demonstrate the complete pattern. Save this at the project root as normalize_newsapi.py:

normalize_newsapi.py
from typing import List, Tuple
from api_helpers import safe_get, extract_items_and_meta
from models import Article


def normalize_newsapi(response: dict) -> Tuple[List[Article], dict]:
    """
    Transform NewsAPI response into canonical Article format.

    Returns a tuple of (articles, meta). The meta dict carries per-response
    bookkeeping; the aggregator uses meta["total"] for the "X of Y matching"
    display, while the dataclass holds only per-article fields.

    NewsAPI structure:
      - articles in 'articles' array at root
      - source nested as object with 'name' field
      - timestamp in 'publishedAt' (ISO 8601)
      - optional: author, description, urlToImage, content
    """
    # Step 1: extract items + meta using Chapter 10 helper
    items, meta = extract_items_and_meta(response)

    articles = []

    # Step 2: loop defensively
    for item in items:
        try:
            # Step 3: extract and validate required fields; fail fast on missing data
            title = item.get("title", "").strip()
            url = item.get("url", "").strip()
            published_at = item.get("publishedAt", "").strip()
            if not title or not url or not published_at:
                continue

            # Step 4: map remaining fields (source name, optionals, format conversions)
            source_name = safe_get(item, "source.name", "Unknown Source")

            description = item.get("description")
            if description:
                description = description.strip()

            author = item.get("author")
            if author:
                author = author.strip()

            image_url = item.get("urlToImage")
            if image_url:
                image_url = image_url.strip()

            content = item.get("content")
            content_preview = None
            if content:
                content_preview = content[:200] + "..." if len(content) > 200 else content

            # Step 5: construct the canonical Article
            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview,
            )

            articles.append(article)

        except (ValueError, KeyError, TypeError) as e:
            # Log and skip malformed items; don't crash entire batch
            print(f"Warning: skipping malformed NewsAPI item: {e}")
            continue

    return articles, meta

Five defensive techniques to notice:

  • Early validation: check required fields before processing optionals, so failures bail per item
  • Continue on failure: skip individual bad items rather than failing the entire batch
  • String cleaning: .strip() removes whitespace that APIs sometimes include
  • Safe nested access: safe_get(item, "source.name") handles missing source object
  • Content truncation: the preview field caps long text so display layouts stay predictable

Test it against real data. Save this at the project root as test_newsapi_normalizer.py:

test_newsapi_normalizer.py
import os
import requests
from dotenv import load_dotenv
from normalize_newsapi import normalize_newsapi

load_dotenv()

NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")

response = requests.get(
    "https://newsapi.org/v2/everything",
    params={"q": "python", "pageSize": 5, "apiKey": NEWSAPI_KEY},
    timeout=10,
).json()

articles, meta = normalize_newsapi(response)

total = meta.get("total")
suffix = f" of {total} matching" if total else ""
print(f"NewsAPI: normalized {len(articles)}{suffix} articles\n")
for i, article in enumerate(articles[:3], 1):
    print(f"{i}. {article.title}")
    print(f"   Source: {article.source_name}")
    print(f"   Published: {article.format_timestamp()}")
    if article.author:
        print(f"   By: {article.author}")
    print()

Run it from the project root:

Terminal
$ python test_newsapi_normalizer.py
NewsAPI: normalized 5 of 12847 matching articles

1. Python 3.13 Performance Improvements
   Source: TechCrunch
   Published: January 15, 2025 at 02:30 PM
   By: Sarah Chen

2. Machine Learning with Python: 2025 Guide
   Source: Medium
   Published: January 14, 2025 at 11:45 AM

3. Django 5.0 Released
   Source: Django Project
   Published: January 13, 2025 at 09:00 AM
   By: Django Team

Guardian normalizer: where it diverges

The Guardian normalizer follows the same five-step pattern. Steps 2 (loop), 3 (validate), and 5 (construct) stay byte-for-byte identical to NewsAPI. The Guardian-specific work all sits in Step 1 (extract items through the response wrapper) and Step 4 (map the "web" prefixes, drill into the optional fields sub-object, use sectionName instead of a nested source).

Field extraction: double nesting and name mapping

Guardian wraps everything in a response object and uses "web" prefixes. Extract through the extra layer and rename to canonical fields:

normalize_guardian.py (extraction step)
from api_helpers import extract_items_and_meta, safe_get

# Guardian wraps everything in 'response'; bail to (articles, meta) shape on bad input.
if not isinstance(response, dict) or "response" not in response:
    return [], {}

response_obj = response["response"]
items, meta = extract_items_and_meta(response_obj, container_hints=["results"])

# Inside the loop: extract Guardian's field names
title = item.get("webTitle", "").strip()                    # not 'title'
url = item.get("webUrl", "").strip()                        # not 'url'
published_at = item.get("webPublicationDate", "").strip()   # not 'publishedAt'

# Source: use section name as identifier
source_name = item.get("sectionName", "The Guardian")

Same concept, different labels. webTitle, webUrl, and webPublicationDate stand in for NewsAPI's simple names.

Validation: identical to NewsAPI

Validation logic stays unchanged. This is the invariant part of the pattern; every normalizer rejects items missing the same required canonical fields:

normalize_guardian.py (validation step)
# Skip items missing required fields; identical pattern to NewsAPI
if not title or not url or not published_at:
    continue

Optional fields: drilling into the nested object

Guardian buries optional metadata in a nested fields object. Reach for safe_get to drill into it with a single call per field; the helper handles the missing-fields-object case for free, so the explicit guard the legacy code needed disappears:

normalize_guardian.py (optional fields)
# safe_get drills into the optional 'fields' sub-object in one call.
description = safe_get(item, "fields.trailText")
if description:
    description = description.strip()

author = safe_get(item, "fields.byline")
if author:
    author = author.strip()

image_url = safe_get(item, "fields.thumbnail")
if image_url:
    image_url = image_url.strip()

body = safe_get(item, "fields.bodyText")
content_preview = None
if body:
    content_preview = body[:200] + "..." if len(body) > 200 else body

The dotted path ("fields.trailText") is what makes safe_get earn its keep here: same tool that handled NewsAPI's source.name, same tool here, no per-API guard logic to maintain. The rest (strip whitespace, truncate content) matches NewsAPI exactly.

Construction: identical dataclass call

Once fields are extracted and transformed, construction is byte-for-byte identical to NewsAPI:

normalize_guardian.py (construction step)
article = Article(
    title=title,
    url=url,
    published_at=published_at,
    source_name=source_name,
    description=description,
    author=author,
    image_url=image_url,
    content_preview=content_preview,
)

articles.append(article)

Compare Guardian's phases to NewsAPI's: extraction changed (different field names, extra nesting), validation stayed identical, optional-field transformation changed (different locations), construction stayed identical. That's how professional normalizers earn their reuse: customize what varies, keep what doesn't.

Show the complete normalize_guardian.py

The fragments above teach the deltas; this is the assembled file you save at the project root. The next page's test_all_normalizers.py imports from it.

normalize_guardian.py
from typing import List, Tuple
from api_helpers import extract_items_and_meta, safe_get
from models import Article


def normalize_guardian(response: dict) -> Tuple[List[Article], dict]:
    """
    Transform Guardian API response into canonical Article format.

    Returns (articles, meta), same shape as normalize_newsapi. The meta dict
    surfaces Guardian's response.total alongside the items.

    Guardian structure:
      - articles in response.results (extra 'response' wrapper)
      - 'web' prefix on field names (webTitle, webUrl, webPublicationDate)
      - section name in 'sectionName'
      - optional metadata nested in 'fields' (trailText, byline, thumbnail, bodyText)
    """
    if not isinstance(response, dict) or "response" not in response:
        return [], {}

    response_obj = response["response"]
    items, meta = extract_items_and_meta(response_obj, container_hints=["results"])

    articles = []
    for item in items:
        try:
            title = item.get("webTitle", "").strip()
            url = item.get("webUrl", "").strip()
            published_at = item.get("webPublicationDate", "").strip()
            if not title or not url or not published_at:
                continue

            source_name = item.get("sectionName", "The Guardian")

            description = safe_get(item, "fields.trailText")
            if description:
                description = description.strip()

            author = safe_get(item, "fields.byline")
            if author:
                author = author.strip()

            image_url = safe_get(item, "fields.thumbnail")
            if image_url:
                image_url = image_url.strip()

            body = safe_get(item, "fields.bodyText")
            content_preview = None
            if body:
                content_preview = body[:200] + "..." if len(body) > 200 else body

            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview,
            )
            articles.append(article)

        except (ValueError, KeyError, TypeError) as e:
            print(f"Warning: skipping malformed Guardian item: {e}")
            continue

    return articles, meta

HackerNews normalizer: where it diverges

HackerNews presents different challenges from both NewsAPI and Guardian: minimal metadata, dual-format timestamps (Unix integer or ISO string), and missing URLs for some post types. The five-step pattern stays the same; the field access logic changes.

Field extraction: URL fallback for Ask HN posts

HackerNews items sometimes lack URLs (Ask HN, Show HN posts). When missing, construct the URL from the object ID. This is the only API requiring URL construction:

normalize_hackernews.py (extraction step)
from api_helpers import extract_items_and_meta, try_fields

# Standard pattern: HackerNews uses 'hits' as the container
items, meta = extract_items_and_meta(response, container_hints=["hits"])

# Inside the loop: HackerNews uses simple field names
title = item.get("title", "").strip()
url = item.get("url", "").strip()

# HackerNews items without URLs are Ask HN or Show HN posts
# Construct URL from objectID when missing
if not url:
    object_id = item.get("objectID", "")
    if object_id:
        url = f"https://news.ycombinator.com/item?id={object_id}"
    # If still no URL after construction, validation below will skip the item

# Source is constant for HackerNews
source_name = "HackerNews"

Same reasoning as the timestamp conversion below: the canonical URL doesn't exist until the objectID fallback has had a chance to reconstruct it, so validation has to wait until after that step.

Optional fields and Unix timestamp conversion

The Algolia API exposes the same publication time twice: as created_at_i (Unix integer) and as created_at (ISO string). Reach for Chapter 10's try_fields to take whichever is present and let the type tell you which conversion path to take. The canonical model wants an ISO 8601 string, so the integer branch converts and the string branch passes through. The optional fields are sparse; most items have just a title, URL, and author username:

normalize_hackernews.py (transformation step)
from datetime import datetime, timezone

# try_fields returns the first present value; the type tells us which branch.
timestamp = try_fields(item, ["created_at_i", "created_at"], default="")
if isinstance(timestamp, int):
    try:
        dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
        published_at = dt.isoformat().replace("+00:00", "Z")
    except (ValueError, OSError):
        # Out-of-range Unix value; fall through to validation, which skips
        # the item because published_at is empty.
        published_at = ""
else:
    # Already an ISO string (or "" if neither field was present).
    published_at = timestamp

# HackerNews has minimal metadata; most fields are sparse
description = item.get("story_text")
if description:
    description = description.strip()

author = item.get("author")  # Username string
if author:
    author = author.strip()

image_url = None  # No images in HackerNews
content_preview = None

Validation and construction: still identical

The validation check itself is byte-for-byte identical to NewsAPI and Guardian (same if not title or not url or not published_at: continue), but it has to run after the timestamp resolution above, not before. NewsAPI and Guardian's published_at arrives from the API ready to validate; HackerNews's doesn't exist until try_fields and the conversion produce it. The check is the same; only its position in source order shifts. Once fields are normalized, the final Article(...) call is identical across all three.

Show the complete normalize_hackernews.py

The fragments above teach the deltas; this is the assembled file you save at the project root. The next page's test_all_normalizers.py imports from it.

normalize_hackernews.py
from datetime import datetime, timezone
from typing import List, Tuple
from api_helpers import extract_items_and_meta, try_fields
from models import Article


def normalize_hackernews(response: dict) -> Tuple[List[Article], dict]:
    """
    Transform HackerNews (Algolia) response into canonical Article format.

    Returns (articles, meta), same shape as the other normalizers. The meta
    dict surfaces HackerNews's nbHits alongside the items.

    HackerNews structure:
      - items in 'hits' array
      - simple field names (title, url, author)
      - dual-format timestamp: created_at_i (Unix int) or created_at (ISO string)
      - URL missing for Ask HN / Show HN posts; constructed from objectID
      - sparse optional metadata
    """
    items, meta = extract_items_and_meta(response, container_hints=["hits"])

    articles = []
    for item in items:
        try:
            title = item.get("title", "").strip()
            url = item.get("url", "").strip()

            # Ask HN / Show HN posts have no URL; construct from objectID.
            if not url:
                object_id = item.get("objectID", "")
                if object_id:
                    url = f"https://news.ycombinator.com/item?id={object_id}"

            source_name = "HackerNews"

            # try_fields returns the first present value; the type tells us
            # which conversion path to take.
            timestamp = try_fields(item, ["created_at_i", "created_at"], default="")
            if isinstance(timestamp, int):
                try:
                    dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
                    published_at = dt.isoformat().replace("+00:00", "Z")
                except (ValueError, OSError):
                    published_at = ""
            else:
                published_at = timestamp

            # Validation runs after timestamp resolution because published_at
            # doesn't exist until try_fields produces it.
            if not title or not url or not published_at:
                continue

            description = item.get("story_text")
            if description:
                description = description.strip()

            author = item.get("author")
            if author:
                author = author.strip()

            image_url = None
            content_preview = None

            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview,
            )
            articles.append(article)

        except (ValueError, KeyError, TypeError) as e:
            print(f"Warning: skipping malformed HackerNews item: {e}")
            continue

    return articles, meta

What stays constant across all three

Steps 2 (loop), 3 (validate), and 5 (construct) are identical across NewsAPI, Guardian, and HackerNews. Only Step 1 (extract items) and Step 4 (map fields) change per API. That's the leverage of the canonical model: the constant part is the bulk of every normalizer; the variable part is just the API-specific mapping logic.

  • NewsAPI: simple fields, nested source object
  • Guardian: "web" prefixes, extra fields object, double nesting
  • HackerNews: URL construction, dual-format timestamps resolved via try_fields, sparse metadata

Same pattern, different quirks. When you add a fourth API, you'll follow this exact five-step pattern and the bulk of the existing code stays unchanged.

Pressure-test all three together

Verify all three normalizers produce articles in the same canonical format. Save this at the project root as test_all_normalizers.py:

test_all_normalizers.py
import os
import requests
from dotenv import load_dotenv
from models import Article
from normalize_newsapi import normalize_newsapi
from normalize_guardian import normalize_guardian
from normalize_hackernews import normalize_hackernews

load_dotenv()

NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
GUARDIAN_KEY = os.environ.get("GUARDIAN_KEY")


def _summary(name, articles, meta):
    total = meta.get("total")
    suffix = f" of {total} matching" if total else ""
    print(f"{name}: {len(articles)}{suffix} articles normalized")


def test_all_normalizers():
    """Test all three normalizers produce canonical output."""
    print("=== TESTING ALL NORMALIZERS ===\n")

    newsapi_response = requests.get(
        "https://newsapi.org/v2/everything",
        params={"q": "python", "pageSize": 2, "apiKey": NEWSAPI_KEY},
        timeout=10,
    ).json()
    newsapi_articles, newsapi_meta = normalize_newsapi(newsapi_response)
    _summary("NewsAPI", newsapi_articles, newsapi_meta)

    guardian_response = requests.get(
        "https://content.guardianapis.com/search",
        params={"q": "python", "page-size": 2, "show-fields": "all",
                "api-key": GUARDIAN_KEY},
        timeout=10,
    ).json()
    guardian_articles, guardian_meta = normalize_guardian(guardian_response)
    _summary("Guardian", guardian_articles, guardian_meta)

    hn_response = requests.get(
        "https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=2",
        timeout=10,
    ).json()
    hn_articles, hn_meta = normalize_hackernews(hn_response)
    _summary("HackerNews", hn_articles, hn_meta)

    all_articles = newsapi_articles + guardian_articles + hn_articles
    print(f"\nTotal: {len(all_articles)} articles")
    print(f"All are Article instances: {all(isinstance(a, Article) for a in all_articles)}")
    print(
        "All have required fields: "
        f"{all(a.title and a.url and a.published_at and a.source_name for a in all_articles)}"
    )


if __name__ == "__main__":
    test_all_normalizers()

Run it from the project root:

Terminal
$ python test_all_normalizers.py
=== TESTING ALL NORMALIZERS ===

NewsAPI: 2 of 12847 matching articles normalized
Guardian: 2 of 8453 matching articles normalized
HackerNews: 2 of 583201 matching articles normalized

Total: 6 articles
All are Article instances: True
All have required fields: True

All three normalizers transform their respective API formats into identical canonical structures, and the meta dict surfaces each API's total-available count alongside the items. Every article has the required fields, uses the same Article class, and is ready for the aggregation pipeline on the next page.

Side-by-side reference

Aspect NewsAPI Guardian HackerNews
Container articles response.results hits
Title field title webTitle title
Timestamp publishedAt (ISO) webPublicationDate (ISO) created_at_i / created_at
(via try_fields)
Source source.name (nested) sectionName Constant: "HackerNews"
Special handling Nested source object Optional fields object URL construction, dual-format timestamps
Error handling Identical: try/except per item, continue on failure

Despite the different field mappings, all three normalizers use identical defensive techniques: early validation of required fields, continue on item failures, string cleaning, safe extraction of optional fields, and consistent None defaults. That's not accidental; it's Chapter 10's toolkit applied systematically.

With three working normalizers, the next page builds the aggregation pipeline that fetches from every source independently, contains failures per source, deduplicates, and sorts the unified results.