Three core entities: users, tweets, and the follow graph. Each has different access patterns, which is why they end up in different stores.
Users
users (
id BIGINT PRIMARY KEY, -- Snowflake
username VARCHAR(15) UNIQUE,
email VARCHAR(255),
password VARCHAR(255), -- bcrypt
bio TEXT,
created_at TIMESTAMP
)
Standard relational. Postgres or MySQL. Sharded by id once you outgrow a single primary, but you won’t for a long time — even 1B users at 1 KB each is only 1 TB.
Why relational?
Tweets and the follow graph end up in DynamoDB (next sections) — so why not users as well? Two reasons specific to this table:
-
Username uniqueness. When someone signs up as
@alice, you need to guarantee no one else can grab that handle at the same time. In Postgres that’s one line:UNIQUE (username). The database refuses the second insert. In DynamoDB there’s no unique constraint on a non-key attribute — you have to fake it by writing a second item likeUSERNAME#alice → user_idinside a transactional write that only succeeds if that item doesn’t already exist. It works, but it’s extra plumbing you have to get right, and email uniqueness needs the same trick again. -
Ad-hoc queries you haven’t designed yet. Over time you’ll want to look users up by email (login, password reset), search by partial name (admin tools), filter by signup date (analytics), etc. In Postgres you add an index and write the query. In DynamoDB every new access pattern needs a Global Secondary Index declared up front, and patterns like “search by partial name” don’t fit at all without bolting on a separate search index.
Users are the one entity in this system small enough (1 TB tops) that you don’t need DynamoDB’s scaling, and structured enough that you actually benefit from SQL’s constraints and query flexibility. So put them in Postgres and move on.
Tweets
tweets
partition key: user_id
sort key: tweet_id -- Snowflake, time-sortable
attributes: text, media_ids, created_at
DynamoDB. 55 TB/year of text means you’re sharding from day one anyway, and once you shard you’ve already given up cross-shard joins and global transactions — which is most of what SQL was buying you. Access patterns are narrow and known: “get a tweet by id” and “get a user’s recent tweets”. Partition by user_id with tweet_id as the sort key gives you both for free, and the user-timeline query becomes a single partition read in reverse sort order.
Sharded SQL (Postgres/MySQL partitioned by user_id) is a defensible alternative — you keep per-shard secondary indexes you can add later without pre-declaring a GSI, and per-shard transactions for things like “insert tweet + bump users.tweet_count” when the user’s row lives on the same shard.
The catch is operational. Running sharded Postgres yourself means you own the shard map, the routing layer, resharding (consistent hashing or virtual shards — see post 7), per-shard primary/replica failover (Patroni or similar), backups per shard, schema migrations applied N times, and hot-shard rebalancing when a celebrity blows up one node. DynamoDB hands you all of that as a managed service: partitions split automatically, replication is built in, and there’s nothing to fail over. Unless you have a strong reason to run your own database, that ops bill alone usually decides it.
Why Snowflake IDs
A Snowflake ID is a 64-bit integer: [timestamp | machine_id | sequence].
timestamp(~41 bits): milliseconds since a custom epoch. Gives the ID its time-ordering.machine_id(~10 bits): identifies the generator (shard, worker, or pod) producing the ID. Lets every machine mint IDs independently without colliding — no central coordinator needed.sequence(~12 bits): a per-millisecond counter on each machine. Resets every millisecond and disambiguates IDs generated on the same machine within the same millisecond (~4096 IDs/ms/machine).
Three properties matter:
- Globally unique without coordination — each shard generates its own.
- Roughly time-sortable — the timestamp is the high bits, so
ORDER BY id DESC≈ORDER BY created_at DESC. Cursor pagination becomes trivial. - Compact — 8 bytes vs 16 for a UUID, which matters when timeline rows reference millions of them.
Why partition by user_id
DynamoDB only lets you query efficiently by partition key. If tweet_id were the partition key, “give me a user’s recent tweets” would need a Scan (full table) or a GSI on user_id — and a GSI on user_id is just a second copy of the table partitioned by user_id, paid for on every write. So skip the indirection: partition by user_id directly, with tweet_id as the sort key. The user-timeline query is then a single partition read in reverse sort order.
The cost is the celebrity hot partition. DynamoDB’s adaptive capacity isolates the heat so it doesn’t starve other partitions, but a single partition still caps out (~3000 RCU / 1000 WCU). Absorb it by caching a celebrity’s recent tweets in Redis so the hot partition only sees misses. Mention the tradeoff explicitly in the interview.
Why not DAX?
DAX is AWS’s in-memory cache that sits in front of DynamoDB and caches GetItem / Query results automatically — looks like a great fit for “cache hot tweets”. Two problems:
- It only caches the shapes DynamoDB already speaks. The home timeline cache needs a list per user that you can keep adding new tweet IDs to the front of, throw away anything past the most recent 800, and then read a page out of by position. Redis has built-in commands for exactly that. DAX doesn’t — it caches the result of a DynamoDB query, not a structure you can mutate item by item. You need Redis for timelines regardless.
- Once you have Redis, adding a second cache isn’t free. DAX is a separate cluster with its own nodes, failover, monitoring, and bill. If Redis is already in your architecture for timelines, putting tweet hydration on it too means one cache to run instead of two.
DAX is the right answer when DynamoDB is your only datastore and you don’t otherwise need Redis. That’s not this design.
The follow graph
follows (
follower_id BIGINT,
followee_id BIGINT,
created_at TIMESTAMP,
PRIMARY KEY (follower_id, followee_id)
)
Two access patterns:
- “Who do I follow?” — query by
follower_id. Used at fanout time. - “Who follows me?” — query by
followee_id. Used to compute reach, and at fanout-on-write time.
You need both, so either two indexes on one table, or two denormalized tables (following and followers). At Twitter scale this lives in a wide-column / KV store like DynamoDB, not a single Postgres — partition by follower_id for one table and followee_id for the other.
Likes and counters
likes (user_id, tweet_id, created_at)
tweet_counts (tweet_id, like_count, retweet_count, reply_count)
Counters are denormalized and updated asynchronously — a like writes to likes, then a stream processor increments tweet_counts. Doing it inline would lock the tweet row on every like and a celebrity tweet would melt the database.
Where things live
| Data | Store | Why |
|---|---|---|
| Users | Postgres / MySQL | Uniqueness constraints, ad-hoc queries, small scale |
| Tweets | DynamoDB (partition: user_id, sort: tweet_id) | Massive scale, narrow access patterns, managed sharding |
| Follow graph | DynamoDB | Massive scale, simple key lookups |
| Media blobs | Object store (S3) + CDN | Cheap, scales independently |
| Counters | Redis or stream-processed in KV | High write rate, eventual consistency fine |
| Home timelines | Redis | Read latency budget — see next post |
The shape of your data and the access pattern decide the store. Don’t pick the store first.