4. Data model

Three core entities: users, tweets, and the follow graph. Each has different access patterns, which is why they end up in different stores.

Users

users (
  id          BIGINT PRIMARY KEY,   -- Snowflake
  username    VARCHAR(15) UNIQUE,
  email       VARCHAR(255),
  password    VARCHAR(255),         -- bcrypt
  bio         TEXT,
  created_at  TIMESTAMP
)

Standard relational. Postgres or MySQL. Sharded by id once you outgrow a single primary, but you won’t for a long time — even 1B users at 1 KB each is only 1 TB.

Why relational?

Tweets and the follow graph end up in DynamoDB (next sections) — so why not users as well? Two reasons specific to this table:

Users are the one entity in this system small enough (1 TB tops) that you don’t need DynamoDB’s scaling, and structured enough that you actually benefit from SQL’s constraints and query flexibility. So put them in Postgres and move on.

Tweets

tweets
  partition key: user_id
  sort key:      tweet_id        -- Snowflake, time-sortable
  attributes:    text, media_ids, created_at

DynamoDB. 55 TB/year of text means you’re sharding from day one anyway, and once you shard you’ve already given up cross-shard joins and global transactions — which is most of what SQL was buying you. Access patterns are narrow and known: “get a tweet by id” and “get a user’s recent tweets”. Partition by user_id with tweet_id as the sort key gives you both for free, and the user-timeline query becomes a single partition read in reverse sort order.

Sharded SQL (Postgres/MySQL partitioned by user_id) is a defensible alternative — you keep per-shard secondary indexes you can add later without pre-declaring a GSI, and per-shard transactions for things like “insert tweet + bump users.tweet_count” when the user’s row lives on the same shard.

The catch is operational. Running sharded Postgres yourself means you own the shard map, the routing layer, resharding (consistent hashing or virtual shards — see post 7), per-shard primary/replica failover (Patroni or similar), backups per shard, schema migrations applied N times, and hot-shard rebalancing when a celebrity blows up one node. DynamoDB hands you all of that as a managed service: partitions split automatically, replication is built in, and there’s nothing to fail over. Unless you have a strong reason to run your own database, that ops bill alone usually decides it.

Why Snowflake IDs

A Snowflake ID is a 64-bit integer: [timestamp | machine_id | sequence].

Three properties matter:

  1. Globally unique without coordination — each shard generates its own.
  2. Roughly time-sortable — the timestamp is the high bits, so ORDER BY id DESCORDER BY created_at DESC. Cursor pagination becomes trivial.
  3. Compact — 8 bytes vs 16 for a UUID, which matters when timeline rows reference millions of them.

Why partition by user_id

DynamoDB only lets you query efficiently by partition key. If tweet_id were the partition key, “give me a user’s recent tweets” would need a Scan (full table) or a GSI on user_id — and a GSI on user_id is just a second copy of the table partitioned by user_id, paid for on every write. So skip the indirection: partition by user_id directly, with tweet_id as the sort key. The user-timeline query is then a single partition read in reverse sort order.

The cost is the celebrity hot partition. DynamoDB’s adaptive capacity isolates the heat so it doesn’t starve other partitions, but a single partition still caps out (~3000 RCU / 1000 WCU). Absorb it by caching a celebrity’s recent tweets in Redis so the hot partition only sees misses. Mention the tradeoff explicitly in the interview.

Why not DAX?

DAX is AWS’s in-memory cache that sits in front of DynamoDB and caches GetItem / Query results automatically — looks like a great fit for “cache hot tweets”. Two problems:

DAX is the right answer when DynamoDB is your only datastore and you don’t otherwise need Redis. That’s not this design.

The follow graph

follows (
  follower_id  BIGINT,
  followee_id  BIGINT,
  created_at   TIMESTAMP,
  PRIMARY KEY (follower_id, followee_id)
)

Two access patterns:

You need both, so either two indexes on one table, or two denormalized tables (following and followers). At Twitter scale this lives in a wide-column / KV store like DynamoDB, not a single Postgres — partition by follower_id for one table and followee_id for the other.

Likes and counters

likes (user_id, tweet_id, created_at)
tweet_counts (tweet_id, like_count, retweet_count, reply_count)

Counters are denormalized and updated asynchronously — a like writes to likes, then a stream processor increments tweet_counts. Doing it inline would lock the tweet row on every like and a celebrity tweet would melt the database.

Where things live

DataStoreWhy
UsersPostgres / MySQLUniqueness constraints, ad-hoc queries, small scale
TweetsDynamoDB (partition: user_id, sort: tweet_id)Massive scale, narrow access patterns, managed sharding
Follow graphDynamoDBMassive scale, simple key lookups
Media blobsObject store (S3) + CDNCheap, scales independently
CountersRedis or stream-processed in KVHigh write rate, eventual consistency fine
Home timelinesRedisRead latency budget — see next post

The shape of your data and the access pattern decide the store. Don’t pick the store first.