If the interviewer asks for search as a follow-up, here’s the shape. Twitter’s real system is called Earlybird; you can describe a simplified version that captures the same ideas.
The problem
Given a query like "climate change", return matching tweets in (mostly) reverse-chronological order, in under 200ms, across hundreds of billions of tweets.
Two access patterns: keyword match, and time-ranged keyword match.
Inverted index
The core data structure. For each term, store a posting list of tweet IDs that contain it.
"climate" → [t_99213, t_99201, t_99012, ...]
"change" → [t_99213, t_99005, t_98870, ...]
A query for "climate change" intersects the two posting lists. Sort by tweet ID descending (Snowflake IDs ≈ time-sorted) and return the top N.
This is what Lucene/Elasticsearch do. You can use Elasticsearch as a starting point and explain its role. At Twitter scale you’d build something custom (Earlybird is in-memory Lucene with custom skip lists) — mention it but don’t pretend to design it from scratch.
Sharding the index
By time, not by term:
- Each shard owns a time range (e.g. one shard per day, or hot shard = last 7 days, warm = last 30, cold = older).
- Queries fan out to relevant shards, results merged at a coordinator.
- Most queries care about recent tweets, so most queries hit the hot shard only.
- Old data can move to cheaper storage; you rarely query it.
Sharding by term would let you target shards precisely — but term frequency follows a Zipf distribution, so a few shards (the ones holding “the”, “trump”, “lol”) would be permanent hotspots.
Real-time indexing
When a tweet is posted, it must be searchable within seconds.
Tweet Service → Kafka topic "tweets" → Search Indexer
│
▼
Hot index shard (in-memory)
The Search Indexer:
- Tokenizes the tweet text (lowercase, split on whitespace, strip punctuation, optionally stem).
- For each term, appends the tweet ID to that term’s posting list in the hot shard.
- Periodically flushes the hot shard to disk and starts a new in-memory segment.
Posting lists are append-only — perfect for a log-structured layout.
Query path
GET /v1/search?q=climate+change&limit=20
- Search Service parses the query, tokenizes terms.
- Fans out to relevant shards (hot first; older shards only if hot returns < N results).
- Each shard intersects posting lists for the query terms, returns top N tweet IDs by recency.
- Coordinator merges results, applies ranking (recency + engagement signal), and hydrates tweet bodies via the Tweet Cache.
Ranking
Pure chronological is fine for “Latest.” For “Top,” you need a ranker — usually a learned model that scores (query, tweet) pairs on engagement features. The infra question is just: where does the ranker run, and what features does it consume? Answer: a separate ranking service that scores the candidate set returned by the index, with features pulled from a low-latency feature store.
What to skip
Don’t design typeahead, autocorrect, language detection, or query understanding unless asked. Get the inverted index, time-sharding, and Kafka-driven indexing across, and you’ve covered the core.