3. Designing the canonical model
Three response shapes go in, one Article shape comes out: the canonical model is the internal language every downstream feature speaks against. This page picks the required fields, the optional fields, the format standards, and writes the dataclass that enforces them all.
From three formats to one standard
The previous page revealed three completely different response structures. Your aggregator needs to work with all three, but you don't want the rest of your application (search logic, display formatting, storage) dealing with this variation. The solution is a canonical model: a single internal representation that all three APIs get normalized into.
Think of it like translation. "Hello," "bonjour," and "hola" all mean the same thing. Your canonical model is the internal language your application speaks. Once articles are normalized, every part of your codebase sees consistent structure regardless of which API provided the data.
Why canonical models matter
- Single interface: display code works with one structure, not three
- API changes contained: when Guardian changes field names, only the normalizer updates
- Testing simplified: test against one format instead of multiple variations
- Extensions easier: adding a fourth API means writing one normalizer, not updating all consumers
Identify the required fields
Start by identifying what all three APIs provide. These become your required fields: the core data that every article must have to be useful. Looking at the previous page's comparison table:
| Field | Why required | Present in |
|---|---|---|
| Title | The headline; without this, there's nothing to show | Expected from all three APIs; required by our normalizers |
| URL | Link to original content; the aggregator's core value | Expected from all three APIs; required by our normalizers |
| Timestamp | Publication time; essential for sorting and recency | All three APIs (formats differ) |
| Source | Where it came from; users need to assess credibility | All three (representation varies) |
These four fields are the universal shape your application needs. Every normalized article must have them; if any are missing or empty, the article gets rejected during normalization.
One critical decision about the URL field: every Article must have a non-empty URL string. Beyond letting users access the original content, URLs serve as the deduplication key. When multiple APIs return the same story (Python 3.13 release coverage appearing in NewsAPI, Guardian, and HackerNews), the URL is the best cheap identifier regardless of varying headlines. The aggregation page will show the URL-based deduplication implementation, and the limits page will show why full URL validation still belongs in the validation layer.
Handle the optional fields
Next, identify fields that enhance the experience but aren't universal. These become optional properties with defensive defaults. The key is balancing usefulness against availability:
| Field | Value when present | Availability |
|---|---|---|
| Description | Context without clicking through | NewsAPI: usually present Guardian: in fields.trailTextHackerNews: rarely present |
| Author | Credibility and attribution | NewsAPI: sometimes Guardian: in fields.bylineHackerNews: username |
| Image | Visual presentation | NewsAPI: urlToImageGuardian: fields.thumbnailHackerNews: none |
| Content preview | Helps users decide to click | NewsAPI: sometimes Guardian: in fields.bodyTextHackerNews: usually absent |
Use None as the default for absent optional fields. This makes the distinction clear: None means "field doesn't exist," while an empty string means "field exists but is empty." Display code can check if article.author without exception handling.
One field from the comparison table doesn't earn a place on Article: total results (NewsAPI's totalResults, Guardian's response.total, HackerNews's nbHits). It's a per-response value, not a per-article one, so it lives on the aggregator's stats dict (covered on the aggregation page) rather than the dataclass. The normalizers return it alongside the article list as part of a meta dict.
Standardise the formats
Some fields appear in all APIs but with different formats. Your canonical model picks one standard format and lets normalizers handle the conversion:
- Timestamps to ISO 8601. HackerNews exposes the same publication time twice: as a Unix integer (
created_at_i:1736950951) and as an ISO string (created_at:2025-01-15T14:30:00Z); NewsAPI and Guardian use ISO strings only. Standard: ISO 8601 strings for consistency and human readability. The HackerNews normalizer reaches fortry_fieldsto take whichever is present and converts the integer form only when that's what the source returned. You'll see this implemented on the next page. - Source names to simple strings. NewsAPI nests source in an object (
source.name), Guardian usessectionName, HackerNews is implicit. Standard: single string field. Normalizers extract from wherever it lives in each API. - URLs to full URLs only. All three provide full URLs, but HackerNews sometimes has items without URLs (Ask HN posts). Standard: full URLs required. Normalizers construct HN links from object IDs when needed.
These standardisation decisions cost conversion work in normalizers but gain predictable access patterns everywhere else in your application.
Write the Article dataclass
Now formalise the canonical model using Python's dataclass. This gives you type hints, default values, validation, and a clean structure that documents your design. Save this at the project root as models.py:
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
from urllib.parse import urlparse
@dataclass
class Article:
"""
Canonical article format for all news sources.
Required fields are enforced for all normalized articles.
Optional fields use None as default when source doesn't provide them.
"""
# Required fields: every article must have these
title: str
url: str
published_at: str # ISO 8601 format: "2025-01-15T14:30:00Z"
source_name: str
# Optional fields: enhance experience but not universal
description: Optional[str] = None
author: Optional[str] = None
image_url: Optional[str] = None
content_preview: Optional[str] = None
def __post_init__(self):
"""Validate required fields after initialization."""
if not self.title or not self.title.strip():
raise ValueError("Article title cannot be empty")
if not self.url or not self.url.strip():
raise ValueError("Article URL cannot be empty")
if not self.published_at or not self.published_at.strip():
raise ValueError("Article published_at cannot be empty")
if not self.source_name or not self.source_name.strip():
raise ValueError("Article source_name cannot be empty")
def format_timestamp(self) -> str:
"""Convert ISO timestamp to human-readable format."""
try:
dt = datetime.fromisoformat(self.published_at.replace("Z", "+00:00"))
return dt.strftime("%B %d, %Y at %I:%M %p")
except (ValueError, AttributeError):
return self.published_at
def get_domain(self) -> str:
"""Extract domain from URL for display."""
try:
parsed = urlparse(self.url)
return parsed.netloc or "unknown"
except (ValueError, AttributeError):
return "unknown"
Four design choices worth highlighting:
- Validation in
__post_init__: catches empty required fields immediately, preventing invalid articles from entering the system - Helper methods:
format_timestamp()andget_domain()provide display utilities without cluttering consumer code - Type hints:
Optional[str]makes optionality explicit for tools and developers - Standardised timestamp: always ISO 8601 string, even though HackerNews exposes both ISO and Unix-integer forms (the normalizer picks via
try_fields)
Pressure-test the model
Before building normalizers, verify the model handles edge cases. Save this at the project root as test_models.py:
from models import Article
# Test 1: Complete article with all fields
article1 = Article(
title="Python 3.13 Released",
url="https://www.python.org/downloads/release/python-3130/",
published_at="2025-01-15T14:30:00Z",
source_name="Python.org",
description="Latest Python release includes performance improvements",
author="Python Core Team",
image_url="https://www.python.org/static/img/python-logo.png",
content_preview="Python 3.13.0 is now available...",
)
print(f"Complete article: {article1.title}")
print(f" Formatted: {article1.format_timestamp()}")
print(f" Domain: {article1.get_domain()}")
# Test 2: Minimal article (only required fields)
article2 = Article(
title="Breaking: Tech News",
url="https://example.com/article",
published_at="2025-01-15T10:00:00Z",
source_name="Example News",
)
print(f"\nMinimal article: {article2.title}")
print(f" Description: {article2.description}") # None
print(f" Author: {article2.author}") # None
# Test 3: Invalid article (empty title)
try:
article3 = Article(
title="",
url="https://example.com/article",
published_at="2025-01-15T10:00:00Z",
source_name="Example",
)
print("\nShould have raised ValueError")
except ValueError as e:
print(f"\nValidation caught error: {e}")
# Test 4: Empty published_at (validated by __post_init__ alongside title/url/source_name)
try:
article4 = Article(
title="Valid title",
url="https://example.com/article",
published_at="",
source_name="Example",
)
print("Should have raised ValueError")
except ValueError as e:
print(f"Validation caught error: {e}")
Run it from the project root:
$ python test_models.py
Complete article: Python 3.13 Released
Formatted: January 15, 2025 at 02:30 PM
Domain: www.python.org
Minimal article: Breaking: Tech News
Description: None
Author: None
Validation caught error: Article title cannot be empty
Validation caught error: Article published_at cannot be empty
The model accepts valid articles with any combination of optional fields, validates required fields at construction, and provides helper methods for display. That's the foundation for the next page's normalizers.
When not to normalize
Canonical models and normalizers add complexity (abstraction layers, conversion logic, maintenance overhead). Before building them, consider whether simpler approaches suffice for your use case.
- Single API integration. If you're only using one API and don't anticipate adding others, work directly with its structure. The Guardian API's
webTitleandwebPublicationDatefields are fine if that's your only source. Normalization's value comes from unifying multiple formats; without that need, it's just extra code. - API structure matches your needs. Sometimes an API's response format is exactly what you need for display or storage. If NewsAPI's article structure works perfectly for your application, don't normalize it into something else just for the sake of abstraction.
- Performance-critical applications. Normalization adds processing time (extracting fields, converting formats, constructing objects). For high-throughput systems processing thousands of articles per second, this overhead matters. Profile first, but if normalization becomes a bottleneck, consider working with raw responses and accepting some code duplication.
- Prototype or proof-of-concept stage. Early in development you're still discovering requirements, so building a canonical model before you understand what fields matter or how APIs will be used is premature optimization. Start with direct API access, identify patterns as they emerge, then normalize when the abstraction provides clear value.
- API-specific features required. If your application needs API-specific features that don't map to other sources (HackerNews vote counts, Guardian section hierarchies, NewsAPI sentiment scores), normalizing loses information. Either work directly with each API's full structure or design a canonical model that accommodates all unique fields (which often becomes unwieldy).
This chapter demonstrates normalization because it's integrating three different news APIs: the exact scenario where canonical models earn their keep. The remaining pages assume you've made that call and focus on doing it well.
With the canonical model defined, tested, and its appropriate use cases understood, the next page builds the normalizers that transform each API's response into this consistent format. You'll see one complete implementation that demonstrates the pattern, then the two adaptations needed for the other APIs.