5. Receiving a 1:1 message

Once the Message Service has durably written a message (chapter 4), it has to get it to the recipient. That means answering two questions on every send: where is the recipient right now, and what do we do if they’re not there. Two new components carry the load. A session store in Redis holds a user_id → gateway_id mapping, written by the gateway on connect and read by the Message Service on every delivery. A pending queue in DynamoDB holds undelivered messages per user until the recipient reconnects. The send path didn’t need either — sending is one-way into storage; delivery is where routing and offline state appear.

Architecture

Connect and register

flowchart TD
Client(["Client B"])
GW["Edge Gateway B
(EC2, WebSocket)"]
Sess[("Session Store
(Redis)
user → gateway")]

Client -->|"① open socket"| GW
GW -->|"② register
   (TTL, refreshed by heartbeat)"| Sess

Delivering to an online recipient

flowchart TD
MS["Message Service
(Fargate)"]
DB[("Message Store
(DynamoDB)
PK conversation_id
SK message_id")]
Sess[("Session Store
(Redis)
user → gateway")]
GW["Edge Gateway B
(EC2, WebSocket)"]
Client(["Client B"])

MS -->|"① durable write"| DB
MS -->|"② lookup recipient"| Sess
MS -->|"③ push frame"| GW
GW -->|"④ deliver frame"| Client

Queueing for an offline recipient

flowchart TD
MS["Message Service
(Fargate)"]
DB[("Message Store
(DynamoDB)
PK conversation_id
SK message_id")]
Sess[("Session Store
(Redis)
user → gateway")]
PQ[("Pending Queue
(DynamoDB)
PK user_id
SK message_id")]

MS -->|"① durable write"| DB
MS -->|"② lookup recipient
   (miss / stale)"| Sess
MS -->|"③ enqueue"| PQ

Reconnect and drain

flowchart TD
Client(["Client B"])
GW["Edge Gateway B
(EC2, WebSocket)"]
Sess[("Session Store
(Redis)
user → gateway")]
PQ[("Pending Queue
(DynamoDB)
PK user_id
SK message_id")]

Client -->|"① open socket"| GW
GW -->|"② register"| Sess
Client -->|"③ drain since cursor"| GW
GW -->|"④ read range"| PQ

The session store

Backend services need to answer “where is user B right now?” on every delivery. That mapping lives in a session store: one key per user, session:{user_id} → gateway_id, written by the gateway when a client connects and given a TTL so the entry expires if the socket dies and the heartbeat lapses. It’s read on every send, tiny per entry, and tolerant of loss (a missing entry just means “treat as offline”) — Redis fits the shape.

DynamoDB would technically work but is the wrong tool: every entry is short-lived, the access pattern is single-key get/set on the hot path, and durability buys nothing because a stale entry has to be treated as a miss anyway. Redis is one to two orders of magnitude cheaper per operation at this access pattern, and native TTLs do the expiry without a sweep job.

Don’t try to route by hashing the user to a fixed gateway. Phones move networks, drop connections, and reconnect to whichever gateway is closest. The mapping has to be dynamic.

Two paths, one decision

After the Message Service has written the message (chapter 4), it needs to get it to the recipient. It looks up the recipient in the session store and branches:

Recipient online — there’s a live entry pointing to a gateway. Push the frame to that gateway, which writes it into the recipient’s open socket.
Recipient offline — no entry, or the entry is stale. Enqueue the message in a per-user pending queue, to be drained when the client reconnects.

The branch is taken on every message. Both sides also write into the durable conversation history (the message store from chapter 4) — the pending queue is a delivery aid, not the source of truth.

The pending queue

One queue per user, keyed by user_id. Each entry is a small envelope: (message_id, conversation_id, sender, body, ts). Entries are appended in arrival order and drained in arrival order on reconnect.

Why not just query the message store on reconnect? It’s keyed by conversation_id, so “what did I miss” would fan out into one range read per conversation the user is in — hundreds of queries for an active user. A GSI on recipient_id would collapse that to one query but pays a GSI write on every message (multiplied by member count for groups) and needs a delivered-flag update to stop re-sending. The pending queue only takes a write when the recipient is actually offline, and entries are deleted on ack — cheaper at steady state and cleaner semantics.

A few properties this needs:

Per-user FIFO, so the recipient sees messages in the order they arrived at the server.
Cheap append, cheap range read, so a long-offline user catching up isn’t a tail latency event.
Bounded retention — once delivered and acked by the client, the entry is removed. If a user is offline for weeks, the queue keeps the messages anyway; the conversation history in the message store is the long-term record, but the queue is what makes “open the app and see the new messages immediately” work without scanning every conversation.

DynamoDB fits the shape: partition key user_id, sort key message_id. Append is a single put; drain is a forward range read from the client’s last-acked cursor. Snowflake IDs (chapter 4) double as the queue ordering — no separate sequence number needed. A Redis sorted set per user (score = Snowflake ID) works too and is faster, but loses entries if the node dies before the next snapshot — for an offline user catching up after days, that’s not acceptable, so the queue lives in DynamoDB.

Reconnect and drain

When the recipient’s app comes back online — phone unlocked, network restored, app foregrounded — it opens a fresh socket to an edge gateway, which registers the new user_id → gateway mapping in the session store. The client then asks the gateway: “give me everything since last_seen_message_id.”

The gateway reads the user’s pending queue from that cursor forward, streams the messages down the socket, and the client acks each one. Acked messages are removed from the queue. If the connection drops mid-drain, the next reconnect resumes from the new last_seen_message_id — no duplicates, because the cursor advances only on client ack.

This is why the queue is keyed by user, not conversation. A returning user wants one ordered stream of “what happened while I was away,” not N per-conversation pulls.

Idempotency and races

Two races to handle:

Reconnect mid-flight. A message arrives at the Message Service while the recipient is in the brief window between dropping their old socket and registering a new one. The session lookup returns stale data; the push to the old gateway fails. The Message Service falls back to enqueueing. The client picks it up on next drain.
Duplicate delivery. A push succeeded on the wire but the client died before acking. On reconnect the message is still in the queue and gets delivered again. The client deduplicates by server message_id — which is why every message has one (chapter 4) and why the client persists the high-water mark of acked IDs locally.

The principle: the server retries; the client deduplicates. Anything else makes one side carry both burdens and gets it wrong under partition.

What the recipient sees

A new chat row pops to the top of the list with an unread count. Inside the conversation, the message slides in. The sender’s grey tick becomes a double grey tick — covered next chapter.