2. Exploring three API structures

Three news services, all returning "articles," all incompatible. Before writing extraction code, you map the shapes; the comparison table that comes out of this page drives every normalizer design decision on the next.

Get API access first

You'll need free API keys for two of the three services. Register, store keys in environment variables, and the rest of the chapter assumes they're available.

  • NewsAPI. Register at newsapi.org for a free key (500 requests/day).
  • The Guardian. Register at open-platform.theguardian.com for a free key (5,000 requests/day).
  • HackerNews. No registration. Hit hn.algolia.com/api directly.

Store the two keys in your project .env file, using the names NEWSAPI_KEY and GUARDIAN_KEY. The code examples load that file with python-dotenv, then read from those names directly.

Run the exploration helper against all three

You already have explore_api_structure in your api_helpers.py from Chapter 10. Save the following script at the project root as explore_three_apis.py:

explore_three_apis.py
import os
from dotenv import load_dotenv
from api_helpers import explore_api_structure

load_dotenv()

NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
GUARDIAN_KEY = os.environ.get("GUARDIAN_KEY")

print("=== EXPLORING THREE NEWS APIS ===\n")

print("1. NEWSAPI")
newsapi_url = (
    f"https://newsapi.org/v2/everything"
    f"?q=python&pageSize=3&apiKey={NEWSAPI_KEY}"
)
newsapi_data = explore_api_structure(newsapi_url)

print("2. GUARDIAN API")
guardian_url = (
    f"https://content.guardianapis.com/search"
    f"?q=python&page-size=3&show-fields=all&api-key={GUARDIAN_KEY}"
)
guardian_data = explore_api_structure(guardian_url)

print("3. HACKERNEWS API")
hn_url = "https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=3"
hn_data = explore_api_structure(hn_url)

Run it from the project root:

Terminal
$ python explore_three_apis.py
=== EXPLORING THREE NEWS APIS ===

1. NEWSAPI
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['status', 'totalResults', 'articles']
Possible data containers found: ['articles']

Sample structure:
{
  "status": "ok",
  "totalResults": 12847,
  "articles": [
    {
      "source": {"id": "...", "name": "..."},
      "author": "...",
      "title": "...",
      "url": "...",
      "publishedAt": "2025-01-15T10:30:00Z",
      "...": "(5 more keys)"
    },
    "... (2 more items)"
  ]
}

2. GUARDIAN API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['response']
Possible data containers found: ['response']

Sample structure:
{
  "response": {
    "status": "ok",
    "total": 8453,
    "results": [
      {
        "webTitle": "...",
        "webUrl": "...",
        "webPublicationDate": "2025-01-15T12:45:00Z",
        "fields": {
          "byline": "...",
          "trailText": "...",
          "...": "(4 more keys)"
        },
        "...": "(6 more keys)"
      },
      "... (2 more items)"
    ],
    "...": "(6 more keys)"
  }
}

3. HACKERNEWS API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['hits', 'nbHits', 'page', 'nbPages', ...]
Possible data containers found: ['hits']

Sample structure:
{
  "hits": [
    {
      "title": "...",
      "url": "...",
      "author": "username123",
      "created_at": "2025-01-15T14:22:31.000Z",
      "created_at_i": 1736950951,
      "points": 142,
      "...": "(8 more keys)"
    },
    "... (2 more items)"
  ],
  "nbHits": 583201,
  "...": "(10 more keys)"
}

Three responses, three completely different shapes. NewsAPI uses articles at the root level with nested source objects. Guardian wraps everything in a response object, puts articles in results, and uses unique field names like webTitle. HackerNews uses hits with both ISO strings and Unix timestamps. Without the helper, you'd be guessing from documentation. With it, the differences are concrete.

Build the comparison table

The next step turns those three exploration outputs into a single reference table -- the spec your normalizers will implement against:

Concept NewsAPI Guardian HackerNews
Container path articles response.results hits
Title title webTitle title
URL url webUrl url
Timestamp publishedAt
(ISO string)
webPublicationDate
(ISO string)
created_at (ISO)
or created_at_i (Unix)
Source name source.name
(nested object)
sectionName
(string)
"HackerNews"
(implied)
Author author
(string, optional)
fields.byline
(nested, optional)
author
(username string)
Description description
(optional)
fields.trailText
(nested, optional)
story_text
(often null)
Image urlToImage
(optional)
fields.thumbnail
(nested, optional)
None available
Total results totalResults response.total nbHits

Same conceptual information, completely different access paths. Some differences are cosmetic (field naming); others are structural (nesting depth, object versus string). All three have titles and URLs, but Guardian prefixes them "web." HackerNews provides dual timestamp formats. Guardian buries useful content in an optional fields object. None of this is a bug; each API reflects different evolution and use cases. The table makes the canonical-model fields explicit so each can be handled systematically; supplementary fields like Guardian's fields.bodyText (used for content previews) are introduced where they're applied on the normalizers page.

The three patterns your normalizers must handle

From the exploration and comparison, three structural patterns emerge:

  • Container variation. Articles live at different paths: articles (root) versus response.results (nested) versus hits (root). Your access layer must locate the container regardless of nesting. extract_items_and_meta from Chapter 10 handles all three with a single container_hints argument.
  • Field naming drift. Same data, different names: title versus webTitle, publishedAt versus webPublicationDate versus created_at. Cross-API drift is handled structurally by writing one normalizer per source rather than one that branches on which API it's looking at. Within a single API, try_fields covers the alternative-key case (HackerNews exposes timestamps as both created_at_i Unix integer and created_at ISO string).
  • Optional field inconsistency. What's present in one API is missing in another. NewsAPI provides description consistently; Guardian requires checking for a fields object; HackerNews rarely has descriptions. Your model must accommodate all patterns. safe_get with a default does this safely.

With structures documented, the next page designs the canonical article model that unifies all three formats into a single internal representation. The comparison table you just built drives every design decision.