2. Exploring three API structures
Three news services, all returning "articles," all incompatible. Before writing extraction code, you map the shapes; the comparison table that comes out of this page drives every normalizer design decision on the next.
Get API access first
You'll need free API keys for two of the three services. Register, store keys in environment variables, and the rest of the chapter assumes they're available.
- NewsAPI. Register at
newsapi.orgfor a free key (500 requests/day). - The Guardian. Register at
open-platform.theguardian.comfor a free key (5,000 requests/day). - HackerNews. No registration. Hit
hn.algolia.com/apidirectly.
Store the two keys in your project .env file, using the names NEWSAPI_KEY and GUARDIAN_KEY. The code examples load that file with python-dotenv, then read from those names directly.
Run the exploration helper against all three
You already have explore_api_structure in your api_helpers.py from Chapter 10. Save the following script at the project root as explore_three_apis.py:
import os
from dotenv import load_dotenv
from api_helpers import explore_api_structure
load_dotenv()
NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
GUARDIAN_KEY = os.environ.get("GUARDIAN_KEY")
print("=== EXPLORING THREE NEWS APIS ===\n")
print("1. NEWSAPI")
newsapi_url = (
f"https://newsapi.org/v2/everything"
f"?q=python&pageSize=3&apiKey={NEWSAPI_KEY}"
)
newsapi_data = explore_api_structure(newsapi_url)
print("2. GUARDIAN API")
guardian_url = (
f"https://content.guardianapis.com/search"
f"?q=python&page-size=3&show-fields=all&api-key={GUARDIAN_KEY}"
)
guardian_data = explore_api_structure(guardian_url)
print("3. HACKERNEWS API")
hn_url = "https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=3"
hn_data = explore_api_structure(hn_url)
Run it from the project root:
$ python explore_three_apis.py
=== EXPLORING THREE NEWS APIS ===
1. NEWSAPI
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['status', 'totalResults', 'articles']
Possible data containers found: ['articles']
Sample structure:
{
"status": "ok",
"totalResults": 12847,
"articles": [
{
"source": {"id": "...", "name": "..."},
"author": "...",
"title": "...",
"url": "...",
"publishedAt": "2025-01-15T10:30:00Z",
"...": "(5 more keys)"
},
"... (2 more items)"
]
}
2. GUARDIAN API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['response']
Possible data containers found: ['response']
Sample structure:
{
"response": {
"status": "ok",
"total": 8453,
"results": [
{
"webTitle": "...",
"webUrl": "...",
"webPublicationDate": "2025-01-15T12:45:00Z",
"fields": {
"byline": "...",
"trailText": "...",
"...": "(4 more keys)"
},
"...": "(6 more keys)"
},
"... (2 more items)"
],
"...": "(6 more keys)"
}
}
3. HACKERNEWS API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['hits', 'nbHits', 'page', 'nbPages', ...]
Possible data containers found: ['hits']
Sample structure:
{
"hits": [
{
"title": "...",
"url": "...",
"author": "username123",
"created_at": "2025-01-15T14:22:31.000Z",
"created_at_i": 1736950951,
"points": 142,
"...": "(8 more keys)"
},
"... (2 more items)"
],
"nbHits": 583201,
"...": "(10 more keys)"
}
Three responses, three completely different shapes. NewsAPI uses articles at the root level with nested source objects. Guardian wraps everything in a response object, puts articles in results, and uses unique field names like webTitle. HackerNews uses hits with both ISO strings and Unix timestamps. Without the helper, you'd be guessing from documentation. With it, the differences are concrete.
Build the comparison table
The next step turns those three exploration outputs into a single reference table -- the spec your normalizers will implement against:
| Concept | NewsAPI | Guardian | HackerNews |
|---|---|---|---|
| Container path | articles |
response.results |
hits |
| Title | title |
webTitle |
title |
| URL | url |
webUrl |
url |
| Timestamp | publishedAt(ISO string) |
webPublicationDate(ISO string) |
created_at (ISO)or created_at_i (Unix) |
| Source name | source.name(nested object) |
sectionName(string) |
"HackerNews" (implied) |
| Author | author(string, optional) |
fields.byline(nested, optional) |
author(username string) |
| Description | description(optional) |
fields.trailText(nested, optional) |
story_text(often null) |
| Image | urlToImage(optional) |
fields.thumbnail(nested, optional) |
None available |
| Total results | totalResults |
response.total |
nbHits |
Same conceptual information, completely different access paths. Some differences are cosmetic (field naming); others are structural (nesting depth, object versus string). All three have titles and URLs, but Guardian prefixes them "web." HackerNews provides dual timestamp formats. Guardian buries useful content in an optional fields object. None of this is a bug; each API reflects different evolution and use cases. The table makes the canonical-model fields explicit so each can be handled systematically; supplementary fields like Guardian's fields.bodyText (used for content previews) are introduced where they're applied on the normalizers page.
The three patterns your normalizers must handle
From the exploration and comparison, three structural patterns emerge:
- Container variation. Articles live at different paths:
articles(root) versusresponse.results(nested) versushits(root). Your access layer must locate the container regardless of nesting.extract_items_and_metafrom Chapter 10 handles all three with a singlecontainer_hintsargument. - Field naming drift. Same data, different names:
titleversuswebTitle,publishedAtversuswebPublicationDateversuscreated_at. Cross-API drift is handled structurally by writing one normalizer per source rather than one that branches on which API it's looking at. Within a single API,try_fieldscovers the alternative-key case (HackerNews exposes timestamps as bothcreated_at_iUnix integer andcreated_atISO string). - Optional field inconsistency. What's present in one API is missing in another. NewsAPI provides
descriptionconsistently; Guardian requires checking for afieldsobject; HackerNews rarely has descriptions. Your model must accommodate all patterns.safe_getwith a default does this safely.
With structures documented, the next page designs the canonical article model that unifies all three formats into a single internal representation. The comparison table you just built drives every design decision.