5. Phase 3: choose your extension

Phase 3 is one feature, picked with intent and shipped well. The core platform from Phase 1 plus Phase 2 already exercises every pattern the book covers; the extension is where you go a layer deeper into one area that interests you, rather than spreading thinly across all of them.

The point isn't to demonstrate every option. Each extension below has its own technical centre of gravity (machine learning, real-time messaging, recommendation algorithms, scheduled jobs, multi-tier caching, internationalisation); picking one and seeing it through end-to-end teaches more than picking three and finishing none. The chapter walks sentiment analysis as the worked example because it has the smallest surface area; the others are sketched in enough detail to scope.

Extension options overview

The core platform is the same across every reader; the extension is the one place where the projects diverge. The factors below are the ones that actually decide it.

Career interests: real-time systems point at the WebSocket alerts extension; ML and data science point at sentiment analysis; infrastructure work points at advanced caching or email scheduling.

Learning goals: pick the extension that takes you furthest from what you've already done. Comfortable with CRUD but never touched NLP, sentiment analysis is the right call; comfortable with backend but never built a stateful front-end, the WebSocket alerts extension is.

Time availability: sentiment analysis is the smallest commitment (5-6 hours); multi-language support is the largest (10-12 hours). Underestimating scope is the most common failure mode in this phase; the time estimates below assume the core platform is already running on AWS.

Extension feature matrix.

Extension Complexity Time est. Technical skills Career path
Sentiment analysis 5-6 hrs ML integration, data analysis Data/ML engineering
Real-time alerts ☆☆ 8-10 hrs WebSockets, async, pub/sub Real-time systems, frontend
Smart recommendations ☆☆ 7-9 hrs Algorithms, user modeling Recommendation systems, product
Email digests ☆☆ 6-8 hrs Background jobs, email services Platform/infrastructure
Advanced caching ☆☆☆ 8-10 hrs Performance optimization, metrics Performance/SRE
Multi-language ☆☆☆ 10-12 hrs Internationalisation, translation APIs Global products
  • What each extension adds.
  • A: sentiment analysis. Polarity and subjectivity scores attached to every article, plus aggregate trend queries over time. Filtering articles by sentiment becomes a query parameter; "positive coverage of climate stories in the past 7 days" is now a single endpoint call.
  • B: real-time alerts. WebSocket subscriptions keyed by user keyword. A background worker polls the upstream APIs on a tight interval, publishes matching new articles onto a Redis pub/sub channel, and the WebSocket connection fan-outs to subscribed clients. Front-end and back-end work together; the front-end is a small HTML/JS demo, not a production UI.
  • C: smart recommendations. Two algorithms running side by side: collaborative filtering (users who read X also read Y) and content-based filtering (articles similar to ones the user has saved), with a popularity-weighted ranking layer on top. Interaction tracking goes through the existing search_history table plus a new article_interactions table.
  • D: email digests. Scheduled background jobs (Celery Beat) generate per-user email summaries by topic preference, delivered through SendGrid. The work is in the infrastructure: scheduling, templating, unsubscribe handling, bounce processing, open and click tracking.
  • E: advanced caching. A multi-tier cache (in-memory LRU as L1, Redis as L2) with cache warming for the platform's top queries, invalidation on new-article ingestion, and per-tier hit-rate metrics surfaced as a CloudWatch custom dashboard.
  • F: multi-language support. Add international sources (BBC Mundo, Le Monde, Deutsche Welle, Al Jazeera), detect article language with langdetect, translate on the fly through the Google Translate API with the translation cached against the article. Language-based filtering becomes another query parameter.

Extension A: sentiment analysis (worked example)

This is the extension the chapter walks end-to-end; the implementation pattern (dependencies, model, service, endpoint integration, tests, documentation) is the same for the other five.

  • What you're building. Polarity and subjectivity scores attached to every article via TextBlob, stored in a new article_sentiments table, exposed via three endpoints (per-article score, aggregate trends, distribution summary), and surfaced as a query-parameter filter on the existing search endpoint.

Why TextBlob: it runs offline, needs no API keys, and gets you to a working end-to-end demo without paying for a managed model API. A production system with strict accuracy requirements would reach for a fine-tuned transformer (BERT, RoBERTa, or a domain-specific model); for the capstone the point is the integration shape, not the model choice.

  • Step 1: install dependencies. Add TextBlob to requirements.txt and download its language corpora:
requirements.txt (add this)
textblob==0.17.1
Terminal
pip install textblob
python -m textblob.download_corpora
  • Step 2: sentiment table and SQLAlchemy model. One row per article, identified by article_id with a UNIQUE constraint so re-analysing updates rather than duplicates. Polarity is the signed magnitude (-1.0 to 1.0); subjectivity is the orthogonal axis (0 = factual, 1 = opinion). The categorical label is derived from polarity in the service layer.
app/schemas/sentiment.py
"""
Sentiment analysis database model.
"""
from sqlalchemy import Column, Integer, Float, String, ForeignKey, DateTime, Index
from sqlalchemy.sql import func
from app.database import Base


class ArticleSentiment(Base):
    """Sentiment scores for articles."""
    __tablename__ = "article_sentiments"

    id = Column(Integer, primary_key=True, index=True)
    article_id = Column(Integer, ForeignKey("articles.id", ondelete="CASCADE"), 
                       nullable=False, unique=True)

    # Sentiment metrics
    polarity = Column(Float, nullable=False)  # -1.0 (negative) to 1.0 (positive)
    subjectivity = Column(Float, nullable=False)  # 0.0 (objective) to 1.0 (subjective)

    # Categorical classification
    sentiment_label = Column(String(20), nullable=False)  # 'positive', 'negative', 'neutral'
    confidence = Column(Float)  # How confident we are (based on polarity magnitude)

    # Analysis metadata
    analyzed_at = Column(DateTime(timezone=True), server_default=func.now())
    text_length = Column(Integer)  # Characters analyzed

    __table_args__ = (
        Index('idx_sentiment_label', 'sentiment_label'),
        Index('idx_polarity', 'polarity'),
        Index('idx_article_id', 'article_id'),
    )

Generate and apply the database migration:

Terminal
alembic revision --autogenerate -m "add article sentiment analysis"
alembic upgrade head
  • Step 3: sentiment analysis service. The service wraps TextBlob, derives the categorical label from polarity using thresholds, and writes through to the database with idempotent upsert semantics.

Polarity: ranges from -1.0 (very negative) to +1.0 (very positive). "This is terrible" scores around -0.7. "This is amazing" scores around +0.6. Neutral factual statements score near 0.0.

Subjectivity: ranges from 0.0 (objective fact) to 1.0 (subjective opinion). "The temperature is 72 degrees" scores 0.0. "I love sunny weather" scores 0.8. TextBlob's polarity signal is more meaningful on subjective text than on factual text.

Classification thresholds: polarity > 0.1 is positive, < -0.1 is negative, -0.1 to 0.1 is neutral. The dead zone around zero stops barely-positive text from being labelled positive.

Complete SentimentAnalyzer service implementation Click to expand 108-line implementation
app/services/sentiment_analyzer.py
"""
Sentiment analysis service using TextBlob.
"""
from typing import Dict, Optional
from textblob import TextBlob
from sqlalchemy.orm import Session
from app.schemas.sentiment import ArticleSentiment


class SentimentAnalyzer:
    """Analyzes text sentiment and stores results."""
    
    # Classification thresholds
    POSITIVE_THRESHOLD = 0.1
    NEGATIVE_THRESHOLD = -0.1
    
    def analyze_text(self, text: str) -> Dict[str, float]:
        """
        Analyze sentiment of given text.
        
        Args:
            text: Text to analyze
            
        Returns:
            Dictionary with polarity, subjectivity, sentiment_label, confidence
        """
        if not text or not text.strip():
            return {
                "polarity": 0.0,
                "subjectivity": 0.0,
                "sentiment_label": "neutral",
                "confidence": 0.0,
                "text_length": 0
            }
        
        # Analyze with TextBlob
        blob = TextBlob(text)
        polarity = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity
        
        # Classify sentiment
        if polarity > self.POSITIVE_THRESHOLD:
            sentiment_label = "positive"
        elif polarity < self.NEGATIVE_THRESHOLD:
            sentiment_label = "negative"
        else:
            sentiment_label = "neutral"
        
        # Calculate confidence (how far from neutral)
        confidence = abs(polarity)
        
        return {
            "polarity": polarity,
            "subjectivity": subjectivity,
            "sentiment_label": sentiment_label,
            "confidence": confidence,
            "text_length": len(text)
        }
    
    def analyze_article(
        self, 
        article_id: int, 
        title: str, 
        description: Optional[str],
        content: Optional[str],
        db: Session
    ) -> ArticleSentiment:
        """
        Analyze article sentiment and store in database.
        
        Args:
            article_id: ID of article to analyze
            title: Article title
            description: Article description/summary
            content: Full article content (if available)
            db: Database session
            
        Returns:
            ArticleSentiment database object
        """
        # Combine all available text
        text_parts = [title]
        if description:
            text_parts.append(description)
        if content:
            text_parts.append(content)
        
        combined_text = " ".join(text_parts)
        
        # Analyze sentiment
        sentiment_data = self.analyze_text(combined_text)
        
        # Check if sentiment already exists
        existing = db.query(ArticleSentiment).filter(
            ArticleSentiment.article_id == article_id
        ).first()
        
        if existing:
            # Update existing record
            existing.polarity = sentiment_data["polarity"]
            existing.subjectivity = sentiment_data["subjectivity"]
            existing.sentiment_label = sentiment_data["sentiment_label"]
            existing.confidence = sentiment_data["confidence"]
            existing.text_length = sentiment_data["text_length"]
            db.commit()
            db.refresh(existing)
            return existing
        
        # Create new sentiment record
        sentiment = ArticleSentiment(
            article_id=article_id,
            polarity=sentiment_data["polarity"],
            subjectivity=sentiment_data["subjectivity"],
            sentiment_label=sentiment_data["sentiment_label"],
            confidence=sentiment_data["confidence"],
            text_length=sentiment_data["text_length"]
        )
        
        db.add(sentiment)
        db.commit()
        db.refresh(sentiment)
        
        return sentiment
  • Steps 4-8: integration and testing.
  • Step 4: storage hook. Call the analyzer from the article-write path so every new article gets analysed exactly once. Re-analysis is idempotent because of the UNIQUE(article_id) constraint.
  • Step 5: sentiment endpoints. GET /articles/{id}/sentiment for the per-article score, GET /articles/sentiment/trends for time-series aggregates, GET /articles/sentiment/summary for the distribution snapshot.
  • Step 6: search filter. Add a ?sentiment=positive|negative|neutral query parameter to the existing search endpoint; the join goes against the new table.
  • Step 7: Pydantic response models. One per endpoint, so the OpenAPI schema documents the response shape and runtime validation catches drift.
  • Step 8: tests. The unit tests cover positive, negative, neutral, and empty-text cases against the analyzer; the integration tests hit the three endpoints and the search filter end-to-end.
  • Steps 9-10: documentation and demo. Add a sentiment-analysis section to the README covering the model choice, the threshold rationale, the three endpoints, and a usage example. Record a short demo video showing a mixed-sentiment search, the same search filtered to positives, the trend endpoint, and the per-article detail.

Additional extension options

The remaining extensions follow the same shape as the sentiment walkthrough: dependencies, model, service, endpoint integration, tests, README. The specifications below give enough detail to scope each one and pick.

  • B: real-time alerts (8-10 hours). WebSocket subscriptions per user keyword. Architecture: Celery worker polls upstream APIs every 5 minutes; new articles are matched against the user-subscription table; matches publish to a Redis pub/sub channel; the WebSocket endpoint fans out to subscribed connections; a small HTML/JS demo page consumes the channel. Surface area: a /ws/alerts endpoint with JWT auth, a subscriptions table keyed by user and keyword, the worker, and the front-end.
  • C: smart recommendations (7-9 hours). Two recommendation paths, surfaced through one endpoint. Algorithms: collaborative filtering (users who read X also read Y) and content-based filtering (articles similar to ones the user has saved), with popularity-weighted ranking as the tiebreaker. Surface area: an article_interactions table (click, save, time-on-article), similarity computation via TF-IDF or precomputed embeddings, a /recommendations endpoint, and a small A/B harness so the two algorithms can be compared on the same user set.
  • D: email digests (6-8 hours). Scheduled per-user digests. Components: Celery Beat for the schedule, SendGrid (or SES) for delivery, an HTML email template with the article cards, a per-user preference table (frequency, topics, delivery time), an unsubscribe token, and webhook handling for bounces, opens, and clicks.
  • E: advanced caching (8-10 hours). Multi-tier cache with observable hit rates. Components: L1 in-memory LRU (bounded to ~1000 entries), L2 Redis with a 5-minute TTL, a warmer that pre-populates the top-100 queries on a schedule, invalidation triggered by the article ingestion path, and CloudWatch custom metrics for per-tier hit rate, average response time with and without each tier, and L1 eviction rate.
  • F: multi-language support (10-12 hours). Add four international sources (BBC Mundo, Le Monde, Deutsche Welle, Al Jazeera), detect article language with langdetect, translate via the Google Translate API and cache the translation against the article row. Complexity drivers: four new client classes, non-Latin character handling, RTL display in the demo UI, translation-cost optimisation so the API bill doesn't grow linearly with traffic.

If you ship two extensions, pick a pair that share data. Sentiment analysis plus real-time alerts lets the alert carry a sentiment label; smart recommendations plus advanced caching gives the recommendation engine a cache surface that's worth measuring; email digests plus sentiment analysis filters the digest by tone. Pairs that don't share data add coordination cost without any payoff.

Next, in section 6, we treat the README, the architecture diagram, the demo video, and the written reflection as deliverables the same way the code is: each one has to land for the project to be runnable and explainable by someone who's never seen it.