9. Search

If the interviewer asks for search as a follow-up, here’s the shape. Twitter’s real system is called Earlybird; you can describe a simplified version that captures the same ideas.

The problem

Given a query like "climate change", return matching tweets in (mostly) reverse-chronological order, in under 200ms, across hundreds of billions of tweets.

Two access patterns: keyword match, and time-ranged keyword match.

Inverted index

The core data structure. For each term, store a posting list of tweet IDs that contain it.

"climate" → [t_99213, t_99201, t_99012, ...]
"change"  → [t_99213, t_99005, t_98870, ...]

A query for "climate change" intersects the two posting lists. Sort by tweet ID descending (Snowflake IDs ≈ time-sorted) and return the top N.

This is what Lucene/Elasticsearch do. You can use Elasticsearch as a starting point and explain its role. At Twitter scale you’d build something custom (Earlybird is in-memory Lucene with custom skip lists) — mention it but don’t pretend to design it from scratch.

Sharding the index

By time, not by term:

Each shard owns a time range (e.g. one shard per day, or hot shard = last 7 days, warm = last 30, cold = older).
Queries fan out to relevant shards, results merged at a coordinator.
Most queries care about recent tweets, so most queries hit the hot shard only.
Old data can move to cheaper storage; you rarely query it.

Sharding by term would let you target shards precisely — but term frequency follows a Zipf distribution, so a few shards (the ones holding “the”, “trump”, “lol”) would be permanent hotspots.

Real-time indexing

When a tweet is posted, it must be searchable within seconds.

Tweet Service → Kafka topic "tweets" → Search Indexer
                                          │
                                          ▼
                                     Hot index shard (in-memory)

The Search Indexer:

Tokenizes the tweet text (lowercase, split on whitespace, strip punctuation, optionally stem).
For each term, appends the tweet ID to that term’s posting list in the hot shard.
Periodically flushes the hot shard to disk and starts a new in-memory segment.

Posting lists are append-only — perfect for a log-structured layout.

Query path

GET /v1/search?q=climate+change&limit=20

Search Service parses the query, tokenizes terms.
Fans out to relevant shards (hot first; older shards only if hot returns < N results).
Each shard intersects posting lists for the query terms, returns top N tweet IDs by recency.
Coordinator merges results, applies ranking (recency + engagement signal), and hydrates tweet bodies via the Tweet Cache.

Ranking

Pure chronological is fine for “Latest.” For “Top,” you need a ranker — usually a learned model that scores (query, tweet) pairs on engagement features. The infra question is just: where does the ranker run, and what features does it consume? Answer: a separate ranking service that scores the candidate set returned by the index, with features pulled from a low-latency feature store.

What to skip

Don’t design typeahead, autocorrect, language detection, or query understanding unless asked. Get the inverted index, time-sharding, and Kafka-driven indexing across, and you’ve covered the core.