4. Source-specific normalizers
A normalizer takes a raw API response and returns a list of Article objects in canonical format, paired with a meta dict carrying per-response bookkeeping like total-available counts. You write three of them: one complete walkthrough for NewsAPI, then deltas for Guardian and HackerNews so you see what changes per source rather than three near-identical files.
The five steps every normalizer runs
Each normalizer handles one API's quirks (field name mappings, nested structures, timestamp conversions, optional field extraction) using Chapter 10's defensive helpers. The shape stays the same; only the field access changes.
- Extract items: find the article array using
extract_items_and_meta() - Loop defensively: process each item with try/except; skip bad items rather than crashing
- Validate required fields: get title, URL, and timestamp;
continueon any missing or empty value (source_namegets a default in step 4, so it's never blank by construction) - Map fields: translate source field names to canonical names, including any format conversions (timestamps, source-name lookups, URL fallbacks)
- Construct Article: create the canonical object with the extracted values
One thing worth flagging up front: step 1's extract_items_and_meta normalizes each API's total-results key (totalResults / response.total / nbHits, per the comparison table on the previous page) into a canonical meta["total"] alongside the items. Every normalizer below reads that uniform key without branching on source, and the aggregator on the next page uses it for the "X of Y matching" display.
This page walks through one complete normalizer (NewsAPI) with detailed explanations, then presents the API-specific deltas for Guardian and HackerNews. Read NewsAPI carefully; skim the others for their unique adaptations.
NewsAPI normalizer (complete)
NewsAPI has the most straightforward structure of the three, which makes it the right place to demonstrate the complete pattern. Save this at the project root as normalize_newsapi.py:
from typing import List, Tuple
from api_helpers import safe_get, extract_items_and_meta
from models import Article
def normalize_newsapi(response: dict) -> Tuple[List[Article], dict]:
"""
Transform NewsAPI response into canonical Article format.
Returns a tuple of (articles, meta). The meta dict carries per-response
bookkeeping; the aggregator uses meta["total"] for the "X of Y matching"
display, while the dataclass holds only per-article fields.
NewsAPI structure:
- articles in 'articles' array at root
- source nested as object with 'name' field
- timestamp in 'publishedAt' (ISO 8601)
- optional: author, description, urlToImage, content
"""
# Step 1: extract items + meta using Chapter 10 helper
items, meta = extract_items_and_meta(response)
articles = []
# Step 2: loop defensively
for item in items:
try:
# Step 3: extract and validate required fields; fail fast on missing data
title = item.get("title", "").strip()
url = item.get("url", "").strip()
published_at = item.get("publishedAt", "").strip()
if not title or not url or not published_at:
continue
# Step 4: map remaining fields (source name, optionals, format conversions)
source_name = safe_get(item, "source.name", "Unknown Source")
description = item.get("description")
if description:
description = description.strip()
author = item.get("author")
if author:
author = author.strip()
image_url = item.get("urlToImage")
if image_url:
image_url = image_url.strip()
content = item.get("content")
content_preview = None
if content:
content_preview = content[:200] + "..." if len(content) > 200 else content
# Step 5: construct the canonical Article
article = Article(
title=title,
url=url,
published_at=published_at,
source_name=source_name,
description=description,
author=author,
image_url=image_url,
content_preview=content_preview,
)
articles.append(article)
except (ValueError, KeyError, TypeError) as e:
# Log and skip malformed items; don't crash entire batch
print(f"Warning: skipping malformed NewsAPI item: {e}")
continue
return articles, meta
Five defensive techniques to notice:
- Early validation: check required fields before processing optionals, so failures bail per item
- Continue on failure: skip individual bad items rather than failing the entire batch
- String cleaning:
.strip()removes whitespace that APIs sometimes include - Safe nested access:
safe_get(item, "source.name")handles missing source object - Content truncation: the preview field caps long text so display layouts stay predictable
Test it against real data. Save this at the project root as test_newsapi_normalizer.py:
import os
import requests
from dotenv import load_dotenv
from normalize_newsapi import normalize_newsapi
load_dotenv()
NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
response = requests.get(
"https://newsapi.org/v2/everything",
params={"q": "python", "pageSize": 5, "apiKey": NEWSAPI_KEY},
timeout=10,
).json()
articles, meta = normalize_newsapi(response)
total = meta.get("total")
suffix = f" of {total} matching" if total else ""
print(f"NewsAPI: normalized {len(articles)}{suffix} articles\n")
for i, article in enumerate(articles[:3], 1):
print(f"{i}. {article.title}")
print(f" Source: {article.source_name}")
print(f" Published: {article.format_timestamp()}")
if article.author:
print(f" By: {article.author}")
print()
Run it from the project root:
$ python test_newsapi_normalizer.py
NewsAPI: normalized 5 of 12847 matching articles
1. Python 3.13 Performance Improvements
Source: TechCrunch
Published: January 15, 2025 at 02:30 PM
By: Sarah Chen
2. Machine Learning with Python: 2025 Guide
Source: Medium
Published: January 14, 2025 at 11:45 AM
3. Django 5.0 Released
Source: Django Project
Published: January 13, 2025 at 09:00 AM
By: Django Team
Guardian normalizer: where it diverges
The Guardian normalizer follows the same five-step pattern. Steps 2 (loop), 3 (validate), and 5 (construct) stay byte-for-byte identical to NewsAPI. The Guardian-specific work all sits in Step 1 (extract items through the response wrapper) and Step 4 (map the "web" prefixes, drill into the optional fields sub-object, use sectionName instead of a nested source).
Field extraction: double nesting and name mapping
Guardian wraps everything in a response object and uses "web" prefixes. Extract through the extra layer and rename to canonical fields:
from api_helpers import extract_items_and_meta, safe_get
# Guardian wraps everything in 'response'; bail to (articles, meta) shape on bad input.
if not isinstance(response, dict) or "response" not in response:
return [], {}
response_obj = response["response"]
items, meta = extract_items_and_meta(response_obj, container_hints=["results"])
# Inside the loop: extract Guardian's field names
title = item.get("webTitle", "").strip() # not 'title'
url = item.get("webUrl", "").strip() # not 'url'
published_at = item.get("webPublicationDate", "").strip() # not 'publishedAt'
# Source: use section name as identifier
source_name = item.get("sectionName", "The Guardian")
Same concept, different labels. webTitle, webUrl, and webPublicationDate stand in for NewsAPI's simple names.
Validation: identical to NewsAPI
Validation logic stays unchanged. This is the invariant part of the pattern; every normalizer rejects items missing the same required canonical fields:
# Skip items missing required fields; identical pattern to NewsAPI
if not title or not url or not published_at:
continue
Optional fields: drilling into the nested object
Guardian buries optional metadata in a nested fields object. Reach for safe_get to drill into it with a single call per field; the helper handles the missing-fields-object case for free, so the explicit guard the legacy code needed disappears:
# safe_get drills into the optional 'fields' sub-object in one call.
description = safe_get(item, "fields.trailText")
if description:
description = description.strip()
author = safe_get(item, "fields.byline")
if author:
author = author.strip()
image_url = safe_get(item, "fields.thumbnail")
if image_url:
image_url = image_url.strip()
body = safe_get(item, "fields.bodyText")
content_preview = None
if body:
content_preview = body[:200] + "..." if len(body) > 200 else body
The dotted path ("fields.trailText") is what makes safe_get earn its keep here: same tool that handled NewsAPI's source.name, same tool here, no per-API guard logic to maintain. The rest (strip whitespace, truncate content) matches NewsAPI exactly.
Construction: identical dataclass call
Once fields are extracted and transformed, construction is byte-for-byte identical to NewsAPI:
article = Article(
title=title,
url=url,
published_at=published_at,
source_name=source_name,
description=description,
author=author,
image_url=image_url,
content_preview=content_preview,
)
articles.append(article)
Compare Guardian's phases to NewsAPI's: extraction changed (different field names, extra nesting), validation stayed identical, optional-field transformation changed (different locations), construction stayed identical. That's how professional normalizers earn their reuse: customize what varies, keep what doesn't.
Show the complete normalize_guardian.py
The fragments above teach the deltas; this is the assembled file you save at the project root. The next page's test_all_normalizers.py imports from it.
from typing import List, Tuple
from api_helpers import extract_items_and_meta, safe_get
from models import Article
def normalize_guardian(response: dict) -> Tuple[List[Article], dict]:
"""
Transform Guardian API response into canonical Article format.
Returns (articles, meta), same shape as normalize_newsapi. The meta dict
surfaces Guardian's response.total alongside the items.
Guardian structure:
- articles in response.results (extra 'response' wrapper)
- 'web' prefix on field names (webTitle, webUrl, webPublicationDate)
- section name in 'sectionName'
- optional metadata nested in 'fields' (trailText, byline, thumbnail, bodyText)
"""
if not isinstance(response, dict) or "response" not in response:
return [], {}
response_obj = response["response"]
items, meta = extract_items_and_meta(response_obj, container_hints=["results"])
articles = []
for item in items:
try:
title = item.get("webTitle", "").strip()
url = item.get("webUrl", "").strip()
published_at = item.get("webPublicationDate", "").strip()
if not title or not url or not published_at:
continue
source_name = item.get("sectionName", "The Guardian")
description = safe_get(item, "fields.trailText")
if description:
description = description.strip()
author = safe_get(item, "fields.byline")
if author:
author = author.strip()
image_url = safe_get(item, "fields.thumbnail")
if image_url:
image_url = image_url.strip()
body = safe_get(item, "fields.bodyText")
content_preview = None
if body:
content_preview = body[:200] + "..." if len(body) > 200 else body
article = Article(
title=title,
url=url,
published_at=published_at,
source_name=source_name,
description=description,
author=author,
image_url=image_url,
content_preview=content_preview,
)
articles.append(article)
except (ValueError, KeyError, TypeError) as e:
print(f"Warning: skipping malformed Guardian item: {e}")
continue
return articles, meta
HackerNews normalizer: where it diverges
HackerNews presents different challenges from both NewsAPI and Guardian: minimal metadata, dual-format timestamps (Unix integer or ISO string), and missing URLs for some post types. The five-step pattern stays the same; the field access logic changes.
Field extraction: URL fallback for Ask HN posts
HackerNews items sometimes lack URLs (Ask HN, Show HN posts). When missing, construct the URL from the object ID. This is the only API requiring URL construction:
from api_helpers import extract_items_and_meta, try_fields
# Standard pattern: HackerNews uses 'hits' as the container
items, meta = extract_items_and_meta(response, container_hints=["hits"])
# Inside the loop: HackerNews uses simple field names
title = item.get("title", "").strip()
url = item.get("url", "").strip()
# HackerNews items without URLs are Ask HN or Show HN posts
# Construct URL from objectID when missing
if not url:
object_id = item.get("objectID", "")
if object_id:
url = f"https://news.ycombinator.com/item?id={object_id}"
# If still no URL after construction, validation below will skip the item
# Source is constant for HackerNews
source_name = "HackerNews"
Same reasoning as the timestamp conversion below: the canonical URL doesn't exist until the objectID fallback has had a chance to reconstruct it, so validation has to wait until after that step.
Optional fields and Unix timestamp conversion
The Algolia API exposes the same publication time twice: as created_at_i (Unix integer) and as created_at (ISO string). Reach for Chapter 10's try_fields to take whichever is present and let the type tell you which conversion path to take. The canonical model wants an ISO 8601 string, so the integer branch converts and the string branch passes through. The optional fields are sparse; most items have just a title, URL, and author username:
from datetime import datetime, timezone
# try_fields returns the first present value; the type tells us which branch.
timestamp = try_fields(item, ["created_at_i", "created_at"], default="")
if isinstance(timestamp, int):
try:
dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
published_at = dt.isoformat().replace("+00:00", "Z")
except (ValueError, OSError):
# Out-of-range Unix value; fall through to validation, which skips
# the item because published_at is empty.
published_at = ""
else:
# Already an ISO string (or "" if neither field was present).
published_at = timestamp
# HackerNews has minimal metadata; most fields are sparse
description = item.get("story_text")
if description:
description = description.strip()
author = item.get("author") # Username string
if author:
author = author.strip()
image_url = None # No images in HackerNews
content_preview = None
Validation and construction: still identical
The validation check itself is byte-for-byte identical to NewsAPI and Guardian (same if not title or not url or not published_at: continue), but it has to run after the timestamp resolution above, not before. NewsAPI and Guardian's published_at arrives from the API ready to validate; HackerNews's doesn't exist until try_fields and the conversion produce it. The check is the same; only its position in source order shifts. Once fields are normalized, the final Article(...) call is identical across all three.
Show the complete normalize_hackernews.py
The fragments above teach the deltas; this is the assembled file you save at the project root. The next page's test_all_normalizers.py imports from it.
from datetime import datetime, timezone
from typing import List, Tuple
from api_helpers import extract_items_and_meta, try_fields
from models import Article
def normalize_hackernews(response: dict) -> Tuple[List[Article], dict]:
"""
Transform HackerNews (Algolia) response into canonical Article format.
Returns (articles, meta), same shape as the other normalizers. The meta
dict surfaces HackerNews's nbHits alongside the items.
HackerNews structure:
- items in 'hits' array
- simple field names (title, url, author)
- dual-format timestamp: created_at_i (Unix int) or created_at (ISO string)
- URL missing for Ask HN / Show HN posts; constructed from objectID
- sparse optional metadata
"""
items, meta = extract_items_and_meta(response, container_hints=["hits"])
articles = []
for item in items:
try:
title = item.get("title", "").strip()
url = item.get("url", "").strip()
# Ask HN / Show HN posts have no URL; construct from objectID.
if not url:
object_id = item.get("objectID", "")
if object_id:
url = f"https://news.ycombinator.com/item?id={object_id}"
source_name = "HackerNews"
# try_fields returns the first present value; the type tells us
# which conversion path to take.
timestamp = try_fields(item, ["created_at_i", "created_at"], default="")
if isinstance(timestamp, int):
try:
dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
published_at = dt.isoformat().replace("+00:00", "Z")
except (ValueError, OSError):
published_at = ""
else:
published_at = timestamp
# Validation runs after timestamp resolution because published_at
# doesn't exist until try_fields produces it.
if not title or not url or not published_at:
continue
description = item.get("story_text")
if description:
description = description.strip()
author = item.get("author")
if author:
author = author.strip()
image_url = None
content_preview = None
article = Article(
title=title,
url=url,
published_at=published_at,
source_name=source_name,
description=description,
author=author,
image_url=image_url,
content_preview=content_preview,
)
articles.append(article)
except (ValueError, KeyError, TypeError) as e:
print(f"Warning: skipping malformed HackerNews item: {e}")
continue
return articles, meta
What stays constant across all three
Steps 2 (loop), 3 (validate), and 5 (construct) are identical across NewsAPI, Guardian, and HackerNews. Only Step 1 (extract items) and Step 4 (map fields) change per API. That's the leverage of the canonical model: the constant part is the bulk of every normalizer; the variable part is just the API-specific mapping logic.
- NewsAPI: simple fields, nested source object
- Guardian: "web" prefixes, extra
fieldsobject, double nesting - HackerNews: URL construction, dual-format timestamps resolved via
try_fields, sparse metadata
Same pattern, different quirks. When you add a fourth API, you'll follow this exact five-step pattern and the bulk of the existing code stays unchanged.
Pressure-test all three together
Verify all three normalizers produce articles in the same canonical format. Save this at the project root as test_all_normalizers.py:
import os
import requests
from dotenv import load_dotenv
from models import Article
from normalize_newsapi import normalize_newsapi
from normalize_guardian import normalize_guardian
from normalize_hackernews import normalize_hackernews
load_dotenv()
NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
GUARDIAN_KEY = os.environ.get("GUARDIAN_KEY")
def _summary(name, articles, meta):
total = meta.get("total")
suffix = f" of {total} matching" if total else ""
print(f"{name}: {len(articles)}{suffix} articles normalized")
def test_all_normalizers():
"""Test all three normalizers produce canonical output."""
print("=== TESTING ALL NORMALIZERS ===\n")
newsapi_response = requests.get(
"https://newsapi.org/v2/everything",
params={"q": "python", "pageSize": 2, "apiKey": NEWSAPI_KEY},
timeout=10,
).json()
newsapi_articles, newsapi_meta = normalize_newsapi(newsapi_response)
_summary("NewsAPI", newsapi_articles, newsapi_meta)
guardian_response = requests.get(
"https://content.guardianapis.com/search",
params={"q": "python", "page-size": 2, "show-fields": "all",
"api-key": GUARDIAN_KEY},
timeout=10,
).json()
guardian_articles, guardian_meta = normalize_guardian(guardian_response)
_summary("Guardian", guardian_articles, guardian_meta)
hn_response = requests.get(
"https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=2",
timeout=10,
).json()
hn_articles, hn_meta = normalize_hackernews(hn_response)
_summary("HackerNews", hn_articles, hn_meta)
all_articles = newsapi_articles + guardian_articles + hn_articles
print(f"\nTotal: {len(all_articles)} articles")
print(f"All are Article instances: {all(isinstance(a, Article) for a in all_articles)}")
print(
"All have required fields: "
f"{all(a.title and a.url and a.published_at and a.source_name for a in all_articles)}"
)
if __name__ == "__main__":
test_all_normalizers()
Run it from the project root:
$ python test_all_normalizers.py
=== TESTING ALL NORMALIZERS ===
NewsAPI: 2 of 12847 matching articles normalized
Guardian: 2 of 8453 matching articles normalized
HackerNews: 2 of 583201 matching articles normalized
Total: 6 articles
All are Article instances: True
All have required fields: True
All three normalizers transform their respective API formats into identical canonical structures, and the meta dict surfaces each API's total-available count alongside the items. Every article has the required fields, uses the same Article class, and is ready for the aggregation pipeline on the next page.
Side-by-side reference
| Aspect | NewsAPI | Guardian | HackerNews |
|---|---|---|---|
| Container | articles |
response.results |
hits |
| Title field | title |
webTitle |
title |
| Timestamp | publishedAt (ISO) |
webPublicationDate (ISO) |
created_at_i / created_at(via try_fields) |
| Source | source.name (nested) |
sectionName |
Constant: "HackerNews" |
| Special handling | Nested source object | Optional fields object |
URL construction, dual-format timestamps |
| Error handling | Identical: try/except per item, continue on failure | ||
Despite the different field mappings, all three normalizers use identical defensive techniques: early validation of required fields, continue on item failures, string cleaning, safe extraction of optional fields, and consistent None defaults. That's not accidental; it's Chapter 10's toolkit applied systematically.
With three working normalizers, the next page builds the aggregation pipeline that fetches from every source independently, contains failures per source, deduplicates, and sorts the unified results.