7. Storage and media

Two storage problems, very different shapes: structured tweet rows (small, lots of them) and media blobs (big, fewer of them). Solve each with the right tool.

Tweets

Tweets are small structured records read by key. DynamoDB, partitioned by user_id with tweet_id as the sort key (see post 4 for why).

Partitioning and resharding

DynamoDB splits and merges partitions for you as the table grows or as traffic shifts. There’s no shard map to maintain in your application, no resharding event to plan, no consistent-hashing ring to operate. The Tweet Service just calls PutItem and Query — the routing is the database’s problem.

This is the main reason to pick a managed wide-column store over running your own sharded Postgres. If you did roll your own, you’d be responsible for the shard map, a routing layer, and a resharding strategy (consistent hashing, or virtual shards mapped onto physical ones) — all of which you avoid by using the managed service.

Replication

DynamoDB writes synchronously to three AZs in the region before acknowledging — durability is built in, no primary/replica setup to run, no failover to test. For multi-region (post 10) you turn on Global Tables and get async cross-region replication; the cost is conflict resolution becoming last-write-wins, which is fine for tweets (immutable) but matters for counters.

Media: don’t put it in the DB

A 280-byte tweet referencing a 5 MB video is fine. The 5 MB video in the DB row is a disaster — your buffer cache fills with one user’s vacation footage and tweet read latency collapses.

Upload flow

1. Client → POST /v1/media/init       → server returns presigned S3 PUT URL + media_id
2. Client → PUT  https://s3.../...    → uploads bytes directly to S3
3. Client → POST /v1/tweets {media_ids:[...]}

The application server never sees the bytes. This:

Processing pipeline

After the PUT, an S3 event triggers a worker that:

This is async. The tweet can post before processing completes; the client shows a “processing…” state until it’s done.

Serving via CDN

Origin S3 is too slow and too expensive to serve every read. Put a CDN (CloudFront, Fastly) in front:

Cache key: media URL. TTL: long (assets are immutable — a new version means a new media_id). Invalidation: rarely needed.

What lives where

DataStorageWhy
Tweet rowsDynamoDB (partitioned by user_id)Small, structured, key-lookup; managed sharding and replication
Media blobsS3Big, immutable, accessed by URL
Media deliveryCDNLatency, bandwidth cost
Hot tweetsRedis (read-through)Hide DB read latency
Home timelinesRedis (precomputed)See fanout post

The pattern repeats: data in its natural store, plus a cache that matches the access pattern. Don’t design the hot path around the cold store.