5. Phase 3: choose your extension
Phase 3 is one feature, picked with intent and shipped well. The core platform from Phase 1 plus Phase 2 already exercises every pattern the book covers; the extension is where you go a layer deeper into one area that interests you, rather than spreading thinly across all of them.
The point isn't to demonstrate every option. Each extension below has its own technical centre of gravity (machine learning, real-time messaging, recommendation algorithms, scheduled jobs, multi-tier caching, internationalisation); picking one and seeing it through end-to-end teaches more than picking three and finishing none. The chapter walks sentiment analysis as the worked example because it has the smallest surface area; the others are sketched in enough detail to scope.
Extension options overview
The core platform is the same across every reader; the extension is the one place where the projects diverge. The factors below are the ones that actually decide it.
Career interests: real-time systems point at the WebSocket alerts extension; ML and data science point at sentiment analysis; infrastructure work points at advanced caching or email scheduling.
Learning goals: pick the extension that takes you furthest from what you've already done. Comfortable with CRUD but never touched NLP, sentiment analysis is the right call; comfortable with backend but never built a stateful front-end, the WebSocket alerts extension is.
Time availability: sentiment analysis is the smallest commitment (5-6 hours); multi-language support is the largest (10-12 hours). Underestimating scope is the most common failure mode in this phase; the time estimates below assume the core platform is already running on AWS.
Extension feature matrix.
| Extension | Complexity | Time est. | Technical skills | Career path |
|---|---|---|---|---|
| Sentiment analysis | ☆ | 5-6 hrs | ML integration, data analysis | Data/ML engineering |
| Real-time alerts | ☆☆ | 8-10 hrs | WebSockets, async, pub/sub | Real-time systems, frontend |
| Smart recommendations | ☆☆ | 7-9 hrs | Algorithms, user modeling | Recommendation systems, product |
| Email digests | ☆☆ | 6-8 hrs | Background jobs, email services | Platform/infrastructure |
| Advanced caching | ☆☆☆ | 8-10 hrs | Performance optimization, metrics | Performance/SRE |
| Multi-language | ☆☆☆ | 10-12 hrs | Internationalisation, translation APIs | Global products |
- What each extension adds.
- A: sentiment analysis. Polarity and subjectivity scores attached to every article, plus aggregate trend queries over time. Filtering articles by sentiment becomes a query parameter; "positive coverage of climate stories in the past 7 days" is now a single endpoint call.
- B: real-time alerts. WebSocket subscriptions keyed by user keyword. A background worker polls the upstream APIs on a tight interval, publishes matching new articles onto a Redis pub/sub channel, and the WebSocket connection fan-outs to subscribed clients. Front-end and back-end work together; the front-end is a small HTML/JS demo, not a production UI.
- C: smart recommendations. Two algorithms running side by side: collaborative filtering (users who read X also read Y) and content-based filtering (articles similar to ones the user has saved), with a popularity-weighted ranking layer on top. Interaction tracking goes through the existing
search_historytable plus a newarticle_interactionstable. - D: email digests. Scheduled background jobs (Celery Beat) generate per-user email summaries by topic preference, delivered through SendGrid. The work is in the infrastructure: scheduling, templating, unsubscribe handling, bounce processing, open and click tracking.
- E: advanced caching. A multi-tier cache (in-memory LRU as L1, Redis as L2) with cache warming for the platform's top queries, invalidation on new-article ingestion, and per-tier hit-rate metrics surfaced as a CloudWatch custom dashboard.
- F: multi-language support. Add international sources (BBC Mundo, Le Monde, Deutsche Welle, Al Jazeera), detect article language with
langdetect, translate on the fly through the Google Translate API with the translation cached against the article. Language-based filtering becomes another query parameter.
Extension A: sentiment analysis (worked example)
This is the extension the chapter walks end-to-end; the implementation pattern (dependencies, model, service, endpoint integration, tests, documentation) is the same for the other five.
- What you're building. Polarity and subjectivity scores attached to every article via TextBlob, stored in a new
article_sentimentstable, exposed via three endpoints (per-article score, aggregate trends, distribution summary), and surfaced as a query-parameter filter on the existing search endpoint.
Why TextBlob: it runs offline, needs no API keys, and gets you to a working end-to-end demo without paying for a managed model API. A production system with strict accuracy requirements would reach for a fine-tuned transformer (BERT, RoBERTa, or a domain-specific model); for the capstone the point is the integration shape, not the model choice.
- Step 1: install dependencies. Add TextBlob to
requirements.txtand download its language corpora:
textblob==0.17.1
pip install textblob
python -m textblob.download_corpora
- Step 2: sentiment table and SQLAlchemy model. One row per article, identified by
article_idwith aUNIQUEconstraint so re-analysing updates rather than duplicates. Polarity is the signed magnitude (-1.0 to 1.0); subjectivity is the orthogonal axis (0 = factual, 1 = opinion). The categorical label is derived from polarity in the service layer.
"""
Sentiment analysis database model.
"""
from sqlalchemy import Column, Integer, Float, String, ForeignKey, DateTime, Index
from sqlalchemy.sql import func
from app.database import Base
class ArticleSentiment(Base):
"""Sentiment scores for articles."""
__tablename__ = "article_sentiments"
id = Column(Integer, primary_key=True, index=True)
article_id = Column(Integer, ForeignKey("articles.id", ondelete="CASCADE"),
nullable=False, unique=True)
# Sentiment metrics
polarity = Column(Float, nullable=False) # -1.0 (negative) to 1.0 (positive)
subjectivity = Column(Float, nullable=False) # 0.0 (objective) to 1.0 (subjective)
# Categorical classification
sentiment_label = Column(String(20), nullable=False) # 'positive', 'negative', 'neutral'
confidence = Column(Float) # How confident we are (based on polarity magnitude)
# Analysis metadata
analyzed_at = Column(DateTime(timezone=True), server_default=func.now())
text_length = Column(Integer) # Characters analyzed
__table_args__ = (
Index('idx_sentiment_label', 'sentiment_label'),
Index('idx_polarity', 'polarity'),
Index('idx_article_id', 'article_id'),
)
Generate and apply the database migration:
alembic revision --autogenerate -m "add article sentiment analysis"
alembic upgrade head
- Step 3: sentiment analysis service. The service wraps TextBlob, derives the categorical label from polarity using thresholds, and writes through to the database with idempotent upsert semantics.
Polarity: ranges from -1.0 (very negative) to +1.0 (very positive). "This is terrible" scores around -0.7. "This is amazing" scores around +0.6. Neutral factual statements score near 0.0.
Subjectivity: ranges from 0.0 (objective fact) to 1.0 (subjective opinion). "The temperature is 72 degrees" scores 0.0. "I love sunny weather" scores 0.8. TextBlob's polarity signal is more meaningful on subjective text than on factual text.
Classification thresholds: polarity > 0.1 is positive, < -0.1 is negative, -0.1 to 0.1 is neutral. The dead zone around zero stops barely-positive text from being labelled positive.
Complete SentimentAnalyzer service implementation Click to expand 108-line implementation
"""
Sentiment analysis service using TextBlob.
"""
from typing import Dict, Optional
from textblob import TextBlob
from sqlalchemy.orm import Session
from app.schemas.sentiment import ArticleSentiment
class SentimentAnalyzer:
"""Analyzes text sentiment and stores results."""
# Classification thresholds
POSITIVE_THRESHOLD = 0.1
NEGATIVE_THRESHOLD = -0.1
def analyze_text(self, text: str) -> Dict[str, float]:
"""
Analyze sentiment of given text.
Args:
text: Text to analyze
Returns:
Dictionary with polarity, subjectivity, sentiment_label, confidence
"""
if not text or not text.strip():
return {
"polarity": 0.0,
"subjectivity": 0.0,
"sentiment_label": "neutral",
"confidence": 0.0,
"text_length": 0
}
# Analyze with TextBlob
blob = TextBlob(text)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
# Classify sentiment
if polarity > self.POSITIVE_THRESHOLD:
sentiment_label = "positive"
elif polarity < self.NEGATIVE_THRESHOLD:
sentiment_label = "negative"
else:
sentiment_label = "neutral"
# Calculate confidence (how far from neutral)
confidence = abs(polarity)
return {
"polarity": polarity,
"subjectivity": subjectivity,
"sentiment_label": sentiment_label,
"confidence": confidence,
"text_length": len(text)
}
def analyze_article(
self,
article_id: int,
title: str,
description: Optional[str],
content: Optional[str],
db: Session
) -> ArticleSentiment:
"""
Analyze article sentiment and store in database.
Args:
article_id: ID of article to analyze
title: Article title
description: Article description/summary
content: Full article content (if available)
db: Database session
Returns:
ArticleSentiment database object
"""
# Combine all available text
text_parts = [title]
if description:
text_parts.append(description)
if content:
text_parts.append(content)
combined_text = " ".join(text_parts)
# Analyze sentiment
sentiment_data = self.analyze_text(combined_text)
# Check if sentiment already exists
existing = db.query(ArticleSentiment).filter(
ArticleSentiment.article_id == article_id
).first()
if existing:
# Update existing record
existing.polarity = sentiment_data["polarity"]
existing.subjectivity = sentiment_data["subjectivity"]
existing.sentiment_label = sentiment_data["sentiment_label"]
existing.confidence = sentiment_data["confidence"]
existing.text_length = sentiment_data["text_length"]
db.commit()
db.refresh(existing)
return existing
# Create new sentiment record
sentiment = ArticleSentiment(
article_id=article_id,
polarity=sentiment_data["polarity"],
subjectivity=sentiment_data["subjectivity"],
sentiment_label=sentiment_data["sentiment_label"],
confidence=sentiment_data["confidence"],
text_length=sentiment_data["text_length"]
)
db.add(sentiment)
db.commit()
db.refresh(sentiment)
return sentiment
- Steps 4-8: integration and testing.
- Step 4: storage hook. Call the analyzer from the article-write path so every new article gets analysed exactly once. Re-analysis is idempotent because of the
UNIQUE(article_id)constraint. - Step 5: sentiment endpoints.
GET /articles/{id}/sentimentfor the per-article score,GET /articles/sentiment/trendsfor time-series aggregates,GET /articles/sentiment/summaryfor the distribution snapshot. - Step 6: search filter. Add a
?sentiment=positive|negative|neutralquery parameter to the existing search endpoint; the join goes against the new table. - Step 7: Pydantic response models. One per endpoint, so the OpenAPI schema documents the response shape and runtime validation catches drift.
- Step 8: tests. The unit tests cover positive, negative, neutral, and empty-text cases against the analyzer; the integration tests hit the three endpoints and the search filter end-to-end.
- Steps 9-10: documentation and demo. Add a sentiment-analysis section to the README covering the model choice, the threshold rationale, the three endpoints, and a usage example. Record a short demo video showing a mixed-sentiment search, the same search filtered to positives, the trend endpoint, and the per-article detail.
Additional extension options
The remaining extensions follow the same shape as the sentiment walkthrough: dependencies, model, service, endpoint integration, tests, README. The specifications below give enough detail to scope each one and pick.
- B: real-time alerts (8-10 hours). WebSocket subscriptions per user keyword. Architecture: Celery worker polls upstream APIs every 5 minutes; new articles are matched against the user-subscription table; matches publish to a Redis pub/sub channel; the WebSocket endpoint fans out to subscribed connections; a small HTML/JS demo page consumes the channel. Surface area: a
/ws/alertsendpoint with JWT auth, asubscriptionstable keyed by user and keyword, the worker, and the front-end. - C: smart recommendations (7-9 hours). Two recommendation paths, surfaced through one endpoint. Algorithms: collaborative filtering (users who read X also read Y) and content-based filtering (articles similar to ones the user has saved), with popularity-weighted ranking as the tiebreaker. Surface area: an
article_interactionstable (click, save, time-on-article), similarity computation via TF-IDF or precomputed embeddings, a/recommendationsendpoint, and a small A/B harness so the two algorithms can be compared on the same user set. - D: email digests (6-8 hours). Scheduled per-user digests. Components: Celery Beat for the schedule, SendGrid (or SES) for delivery, an HTML email template with the article cards, a per-user preference table (frequency, topics, delivery time), an unsubscribe token, and webhook handling for bounces, opens, and clicks.
- E: advanced caching (8-10 hours). Multi-tier cache with observable hit rates. Components: L1 in-memory LRU (bounded to ~1000 entries), L2 Redis with a 5-minute TTL, a warmer that pre-populates the top-100 queries on a schedule, invalidation triggered by the article ingestion path, and CloudWatch custom metrics for per-tier hit rate, average response time with and without each tier, and L1 eviction rate.
- F: multi-language support (10-12 hours). Add four international sources (BBC Mundo, Le Monde, Deutsche Welle, Al Jazeera), detect article language with
langdetect, translate via the Google Translate API and cache the translation against the article row. Complexity drivers: four new client classes, non-Latin character handling, RTL display in the demo UI, translation-cost optimisation so the API bill doesn't grow linearly with traffic.
If you ship two extensions, pick a pair that share data. Sentiment analysis plus real-time alerts lets the alert carry a sentiment label; smart recommendations plus advanced caching gives the recommendation engine a cache surface that's worth measuring; email digests plus sentiment analysis filters the digest by tone. Pairs that don't share data add coordination cost without any payoff.
Next, in section 6, we treat the README, the architecture diagram, the demo video, and the written reflection as deliverables the same way the code is: each one has to land for the project to be runnable and explainable by someone who's never seen it.