3. Designing the canonical model

Three response shapes go in, one Article shape comes out: the canonical model is the internal language every downstream feature speaks against. This page picks the required fields, the optional fields, the format standards, and writes the dataclass that enforces them all.

From three formats to one standard

The previous page revealed three completely different response structures. Your aggregator needs to work with all three, but you don't want the rest of your application (search logic, display formatting, storage) dealing with this variation. The solution is a canonical model: a single internal representation that all three APIs get normalized into.

Think of it like translation. "Hello," "bonjour," and "hola" all mean the same thing. Your canonical model is the internal language your application speaks. Once articles are normalized, every part of your codebase sees consistent structure regardless of which API provided the data.

Why canonical models matter

  • Single interface: display code works with one structure, not three
  • API changes contained: when Guardian changes field names, only the normalizer updates
  • Testing simplified: test against one format instead of multiple variations
  • Extensions easier: adding a fourth API means writing one normalizer, not updating all consumers

Identify the required fields

Start by identifying what all three APIs provide. These become your required fields: the core data that every article must have to be useful. Looking at the previous page's comparison table:

Field Why required Present in
Title The headline; without this, there's nothing to show Expected from all three APIs; required by our normalizers
URL Link to original content; the aggregator's core value Expected from all three APIs; required by our normalizers
Timestamp Publication time; essential for sorting and recency All three APIs (formats differ)
Source Where it came from; users need to assess credibility All three (representation varies)

These four fields are the universal shape your application needs. Every normalized article must have them; if any are missing or empty, the article gets rejected during normalization.

One critical decision about the URL field: every Article must have a non-empty URL string. Beyond letting users access the original content, URLs serve as the deduplication key. When multiple APIs return the same story (Python 3.13 release coverage appearing in NewsAPI, Guardian, and HackerNews), the URL is the best cheap identifier regardless of varying headlines. The aggregation page will show the URL-based deduplication implementation, and the limits page will show why full URL validation still belongs in the validation layer.

Handle the optional fields

Next, identify fields that enhance the experience but aren't universal. These become optional properties with defensive defaults. The key is balancing usefulness against availability:

Field Value when present Availability
Description Context without clicking through NewsAPI: usually present
Guardian: in fields.trailText
HackerNews: rarely present
Author Credibility and attribution NewsAPI: sometimes
Guardian: in fields.byline
HackerNews: username
Image Visual presentation NewsAPI: urlToImage
Guardian: fields.thumbnail
HackerNews: none
Content preview Helps users decide to click NewsAPI: sometimes
Guardian: in fields.bodyText
HackerNews: usually absent

Use None as the default for absent optional fields. This makes the distinction clear: None means "field doesn't exist," while an empty string means "field exists but is empty." Display code can check if article.author without exception handling.

One field from the comparison table doesn't earn a place on Article: total results (NewsAPI's totalResults, Guardian's response.total, HackerNews's nbHits). It's a per-response value, not a per-article one, so it lives on the aggregator's stats dict (covered on the aggregation page) rather than the dataclass. The normalizers return it alongside the article list as part of a meta dict.

Standardise the formats

Some fields appear in all APIs but with different formats. Your canonical model picks one standard format and lets normalizers handle the conversion:

  • Timestamps to ISO 8601. HackerNews exposes the same publication time twice: as a Unix integer (created_at_i: 1736950951) and as an ISO string (created_at: 2025-01-15T14:30:00Z); NewsAPI and Guardian use ISO strings only. Standard: ISO 8601 strings for consistency and human readability. The HackerNews normalizer reaches for try_fields to take whichever is present and converts the integer form only when that's what the source returned. You'll see this implemented on the next page.
  • Source names to simple strings. NewsAPI nests source in an object (source.name), Guardian uses sectionName, HackerNews is implicit. Standard: single string field. Normalizers extract from wherever it lives in each API.
  • URLs to full URLs only. All three provide full URLs, but HackerNews sometimes has items without URLs (Ask HN posts). Standard: full URLs required. Normalizers construct HN links from object IDs when needed.

These standardisation decisions cost conversion work in normalizers but gain predictable access patterns everywhere else in your application.

Write the Article dataclass

Now formalise the canonical model using Python's dataclass. This gives you type hints, default values, validation, and a clean structure that documents your design. Save this at the project root as models.py:

models.py
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
from urllib.parse import urlparse


@dataclass
class Article:
    """
    Canonical article format for all news sources.

    Required fields are enforced for all normalized articles.
    Optional fields use None as default when source doesn't provide them.
    """
    # Required fields: every article must have these
    title: str
    url: str
    published_at: str  # ISO 8601 format: "2025-01-15T14:30:00Z"
    source_name: str

    # Optional fields: enhance experience but not universal
    description: Optional[str] = None
    author: Optional[str] = None
    image_url: Optional[str] = None
    content_preview: Optional[str] = None

    def __post_init__(self):
        """Validate required fields after initialization."""
        if not self.title or not self.title.strip():
            raise ValueError("Article title cannot be empty")
        if not self.url or not self.url.strip():
            raise ValueError("Article URL cannot be empty")
        if not self.published_at or not self.published_at.strip():
            raise ValueError("Article published_at cannot be empty")
        if not self.source_name or not self.source_name.strip():
            raise ValueError("Article source_name cannot be empty")

    def format_timestamp(self) -> str:
        """Convert ISO timestamp to human-readable format."""
        try:
            dt = datetime.fromisoformat(self.published_at.replace("Z", "+00:00"))
            return dt.strftime("%B %d, %Y at %I:%M %p")
        except (ValueError, AttributeError):
            return self.published_at

    def get_domain(self) -> str:
        """Extract domain from URL for display."""
        try:
            parsed = urlparse(self.url)
            return parsed.netloc or "unknown"
        except (ValueError, AttributeError):
            return "unknown"

Four design choices worth highlighting:

  • Validation in __post_init__: catches empty required fields immediately, preventing invalid articles from entering the system
  • Helper methods: format_timestamp() and get_domain() provide display utilities without cluttering consumer code
  • Type hints: Optional[str] makes optionality explicit for tools and developers
  • Standardised timestamp: always ISO 8601 string, even though HackerNews exposes both ISO and Unix-integer forms (the normalizer picks via try_fields)

Pressure-test the model

Before building normalizers, verify the model handles edge cases. Save this at the project root as test_models.py:

test_models.py
from models import Article

# Test 1: Complete article with all fields
article1 = Article(
    title="Python 3.13 Released",
    url="https://www.python.org/downloads/release/python-3130/",
    published_at="2025-01-15T14:30:00Z",
    source_name="Python.org",
    description="Latest Python release includes performance improvements",
    author="Python Core Team",
    image_url="https://www.python.org/static/img/python-logo.png",
    content_preview="Python 3.13.0 is now available...",
)
print(f"Complete article: {article1.title}")
print(f"  Formatted: {article1.format_timestamp()}")
print(f"  Domain: {article1.get_domain()}")

# Test 2: Minimal article (only required fields)
article2 = Article(
    title="Breaking: Tech News",
    url="https://example.com/article",
    published_at="2025-01-15T10:00:00Z",
    source_name="Example News",
)
print(f"\nMinimal article: {article2.title}")
print(f"  Description: {article2.description}")  # None
print(f"  Author: {article2.author}")  # None

# Test 3: Invalid article (empty title)
try:
    article3 = Article(
        title="",
        url="https://example.com/article",
        published_at="2025-01-15T10:00:00Z",
        source_name="Example",
    )
    print("\nShould have raised ValueError")
except ValueError as e:
    print(f"\nValidation caught error: {e}")

# Test 4: Empty published_at (validated by __post_init__ alongside title/url/source_name)
try:
    article4 = Article(
        title="Valid title",
        url="https://example.com/article",
        published_at="",
        source_name="Example",
    )
    print("Should have raised ValueError")
except ValueError as e:
    print(f"Validation caught error: {e}")

Run it from the project root:

Terminal
$ python test_models.py
Complete article: Python 3.13 Released
  Formatted: January 15, 2025 at 02:30 PM
  Domain: www.python.org

Minimal article: Breaking: Tech News
  Description: None
  Author: None

Validation caught error: Article title cannot be empty
Validation caught error: Article published_at cannot be empty

The model accepts valid articles with any combination of optional fields, validates required fields at construction, and provides helper methods for display. That's the foundation for the next page's normalizers.

When not to normalize

Canonical models and normalizers add complexity (abstraction layers, conversion logic, maintenance overhead). Before building them, consider whether simpler approaches suffice for your use case.

  • Single API integration. If you're only using one API and don't anticipate adding others, work directly with its structure. The Guardian API's webTitle and webPublicationDate fields are fine if that's your only source. Normalization's value comes from unifying multiple formats; without that need, it's just extra code.
  • API structure matches your needs. Sometimes an API's response format is exactly what you need for display or storage. If NewsAPI's article structure works perfectly for your application, don't normalize it into something else just for the sake of abstraction.
  • Performance-critical applications. Normalization adds processing time (extracting fields, converting formats, constructing objects). For high-throughput systems processing thousands of articles per second, this overhead matters. Profile first, but if normalization becomes a bottleneck, consider working with raw responses and accepting some code duplication.
  • Prototype or proof-of-concept stage. Early in development you're still discovering requirements, so building a canonical model before you understand what fields matter or how APIs will be used is premature optimization. Start with direct API access, identify patterns as they emerge, then normalize when the abstraction provides clear value.
  • API-specific features required. If your application needs API-specific features that don't map to other sources (HackerNews vote counts, Guardian section hierarchies, NewsAPI sentiment scores), normalizing loses information. Either work directly with each API's full structure or design a canonical model that accommodates all unique fields (which often becomes unwieldy).

This chapter demonstrates normalization because it's integrating three different news APIs: the exact scenario where canonical models earn their keep. The remaining pages assume you've made that call and focus on doing it well.

With the canonical model defined, tested, and its appropriate use cases understood, the next page builds the normalizers that transform each API's response into this consistent format. You'll see one complete implementation that demonstrates the pattern, then the two adaptations needed for the other APIs.