Architecture
flowchart TD
Client(["Client A"])
GW["Edge Gateway
(EC2, WebSocket)
holds the socket"]
MS["Message Service
(Fargate)
mints Snowflake ID"]
DB[("Message Store
(DynamoDB)
PK conversation_id
SK message_id")]
Client -->|"① open socket
② send msg"| GW
GW -->|"③ forward"| MS
MS -->|"④ write"| DB
MS -->|"⑤ ack"| GW
GW -->|"⑥ ack"| Client
The connection, not the request
WhatsApp does not work like a REST app. The client opens a long-lived connection — a TLS-wrapped WebSocket or a custom framed protocol over TCP — to the nearest edge gateway when the app starts, and keeps it open. Sending a message is a frame written into that already-open socket, not a fresh HTTPS request. Two reasons this matters:
- Latency: no TLS handshake on the hot path. A new HTTPS request on a phone network costs hundreds of milliseconds.
- Push: the same socket is what the server uses to push incoming messages to this client (chapter 5). One socket, both directions.
The gateway tier is its own fleet. Each box holds something like a million idle sockets and does almost nothing else — it just shuttles frames between clients and the backend services. These run on EC2, not Fargate: holding that many sockets per host needs kernel-level tuning (file-descriptor limits, ephemeral port range, TCP keepalive) that a serverless container runtime doesn’t expose. App servers that handle real work — writing to the database, fanning out, transcoding media — are stateless and run on Fargate behind the gateway over normal request-response RPC.
The message frame
The client already has the recipient’s user_id from the contact-sync cache (chapter 3), so the frame can address the recipient directly without a lookup on the send path. What it sends looks roughly like:
SEND
client_msg_id: c_7f3a... (idempotency key)
recipient_id: user_b
body: "hello"
ts: 1731000000123
Three things to call out:
- No
sender_idin the body. The gateway already authenticated the connection at handshake; the server resolves the sender from the socket. Trusting a sender field in the frame would let anyone send as anyone else. - No
conversation_ideither. The Message Service derives it from(sender, recipient)by sorting the pair and hashing:conversation_id = sha1(min(a, b) || ":" || max(a, b)). Both clients land on the same partition without a “create conversation” round trip, and there’s nothing for a malicious client to lie about. client_msg_idis the idempotency key. If the client retries on a flaky network, the server returns the same server-sidemessage_idrather than creating a duplicate. Stored briefly (hours, not days) — enough to cover retries.- Rate limiting belongs at the gateway: per-user token bucket. A compromised client shouldn’t be able to flood the message service.
Groups don’t have a natural pair to hash, so group creation mints a Snowflake conversation_id and stores it on the group record. The client sends that ID directly. Covered in chapter 9.
The server-side ID
The Message Service mints a Snowflake ID — a 64-bit integer laid out as [timestamp | machine_id | sequence].
timestamp(~41 bits): milliseconds since a custom epoch. Gives the ID its time-ordering.machine_id(~10 bits): identifies the generator. Lets every machine mint IDs without coordinating.sequence(~12 bits): per-millisecond counter on each machine.
Three properties matter:
- Globally unique without coordination — each shard generates its own.
- Roughly time-sortable — the timestamp is the high bits, so sorting by id ≈ sorting by time. Pagination through history becomes trivial.
- Compact — 8 bytes vs 16 for a UUID. Across hundreds of billions of messages this is real storage.
Snowflake IDs show up again for receipts (chapter 6) and group fanout (chapter 9).
Where the message lands
A wide-column store partitioned by conversation. One table:
messages
partition key: conversation_id
sort key: message_id — Snowflake, time-sortable
attributes: sender_id, body, ts, status
The dominant read pattern is “give me the recent messages in this conversation” — opening a chat, scrolling back. Partitioning by conversation_id with message_id as the sort key makes that a single-partition reverse scan.
DynamoDB fits the shape: conversation_id as the partition key, message_id as the sort key. Localising a chat’s history to one partition keeps the hot path cheap — recent-messages reads are a single backwards scan within the partition.
A relational database is the wrong tool. The workload is append-heavy at a scale (hundreds of billions of rows, millions of writes per second) where a single Postgres instance falls over, so you’d be sharding by conversation_id manually — recreating DynamoDB’s partition model on top of a database whose features (joins, multi-row transactions, secondary indexes across shards) you can’t use anyway. No conversation ever needs to be joined to another; no read crosses a partition. The access pattern is exactly “point to a partition, range-scan the sort key,” which is the one thing DynamoDB is built to do cheaply.
What the sender sees
The synchronous path is short:
- Client writes
SENDframe to its open socket. - Edge gateway forwards to the Message Service.
- Message Service mints the Snowflake ID, writes the row, returns the ID.
- Gateway pushes an
ACKframe back to the client with the servermessage_id.
The client paints a single grey tick the moment the ack arrives. It already has the body — the ack is just confirming the server has the message durably. The recipient’s grey-tick-becomes-double-tick is a separate concern, covered in chapters 5 and 6.