10. Trending topics and the bottlenecks worth naming

Last post in the series. Two things to cover: trending topics (a common follow-up) and the failure modes you should mention proactively to show you’ve thought past the happy path.

The problem: identify the top K hashtags or terms by frequency over a sliding window (last 5 minutes, last hour, last day), per region, in real time.

Naive: count every term, sort, take top K. Doesn’t scale — billions of term occurrences per day.

Approximate counting

You don’t need exact counts. “Trending” only cares about the heavy hitters at the top. Use approximate algorithms:

Pipeline

Tweets → Kafka → Stream processor (Flink/Spark Streaming)


                  per-region Count-Min Sketches + Top-K heaps


                  Trending API ← reads top-K every few seconds

Sketches are kept per region (US, EU, JP, …) so you can serve localized trends without resharding.

Spam and quality

Raw frequency is gameable. Real ranking de-dupes near-identical tweets, weights by unique authors, filters slurs, and applies a “novelty” boost so persistently-popular terms don’t always trend. Mention this — most candidates miss it.

Bottlenecks worth naming

Before the interviewer asks “what could go wrong,” name them yourself.

The celebrity problem (revisited)

Even with hybrid fanout, when a celebrity tweets, every follower’s next timeline read does a live pull. If a celebrity tweets at the exact moment of a peak event, you get a read spike on the celebrity’s user-timeline cache. Mitigation: cache the celebrity’s most recent tweets in an extra-warm tier with high replication; pre-warm on tweet publish.

Hot shards

Sharding by user_id means a single celebrity user is always on one shard. That shard handles disproportionate read load for the user’s profile and tweets. Mitigation: replicate hot users to multiple shards (read-only copies), route reads round-robin.

Cache stampede

Already covered — request coalescing, replicas, gradual warmup.

Fanout backlog

If the fanout worker can’t keep up (e.g. 10 celebrities tweet at once and you’ve configured them as push), the Kafka lag grows and timelines go stale. Mitigation: consumer autoscaling, per-author parallelism, and a lower celebrity threshold during incidents.

Replication lag

A user posts, then immediately reloads — and their tweet isn’t there because the read hit a lagging replica. Solution: route reads-after-writes for the same session to the primary for a short window, or read your own writes from a write-through cache.

Cross-region writes

Global users mean either multi-region writes (CRDTs / last-write-wins on counters; primary in one region for tweets) or accept that users write to their nearest region and tweets propagate. Twitter went with the latter — followers in another region see your tweet a few seconds later. State this tradeoff explicitly.

What to leave the interviewer with

In the last two minutes, summarize the design in three sentences:

  1. Hybrid fanout — push for normal users, pull for celebrities — to balance read and write costs given the 100:1 ratio.
  2. A read path that’s Redis → DynamoDB by key for sub-200ms latency.
  3. Async everything that doesn’t block user feedback — fanout, indexing, counters, media processing — via Kafka.

Then list what you’d build next: notifications, DMs, ranking, ads. Showing you know what’s missing is as important as showing what you built.

That’s the system. Good luck in the interview.