9. Group messaging

Architecture

flowchart TD
Client_A(["Client A"])
GW["Edge Gateway
(EC2, WebSocket)"]
MS["Message Service
(Fargate)"]
GMS["Group Metadata
Service
(Fargate)"]
GDB[("Groups
(DynamoDB)
members, admins")]
Hist[("Message Store
(DynamoDB)")]
PQ[("Pending Queues
(DynamoDB)
per member")]
Sess[("Session Store
(Redis)")]

Client_A -->|"① SEND group_id"| GW
GW --> MS
MS -->|"② lookup members"| GMS
GMS --> GDB
MS -->|"③ write history once"| Hist
MS -->|"④ enqueue per member"| PQ
MS -->|"⑤ push online members"| Sess

A group is a conversation with N members. Sending to a group is the 1:1 send path repeated N times, with one extra lookup at the front: who are the members?

Group metadata

A separate Group Metadata Service owns groups: name, members, admins, settings (who can send, who can edit info). The data is small and read-heavy — every group send hits it. Stored in DynamoDB across two tables — groups (partition key group_id) and group_members (partition key group_id, sort key user_id):

groups
  group_id      — Snowflake
  name, icon
  admins        — small list
  created_at
group_members
  group_id, user_id, role, joined_at

A relational schema with a groups table and a group_members join table is the obvious alternative, and at small scale it would work. DynamoDB wins for the same reason it did in chapter 4: the access pattern is “give me all members of one group” — a single-partition range read keyed by group_id — and that maps to one DynamoDB read with no joins. As the user base grows past what a single Postgres instance handles, the relational version forces sharding by group_id and the joins disappear anyway. Going straight to DynamoDB skips the migration.

The group_members table is queried every time someone sends to the group. With small groups (cap ~256 members for now; large groups in chapter 10), the member list is small enough to cache per-group in Redis (group:{group_id}:members → set of user_id) fronted by an in-process LRU at the Message Service. Cache invalidation is event-driven: when someone joins or leaves, the Group Metadata Service publishes a change event and both caches evict.

The fanout

Sender sends one frame with group_id. The Message Service:

  1. Resolves members. Cache hit, or fall back to the metadata service.
  2. Writes the message once to the message store, partitioned by conversation_id = group_id. There’s a single canonical row per message, not N copies. Group history is shared.
  3. Fans out delivery. For each member except the sender, enqueue an envelope into the member’s pending queue (chapter 5) and, if the session store says they’re online, push the frame to their gateway.
  4. Acks the sender once the durable write and the enqueues have committed.

This is fanout-on-write: the work happens at send time, not read time. For a group of 10 it’s 10 cheap enqueues. For 256 it’s 256. That’s fine — the queue store is built for this.

The opposite pattern, fanout-on-read, would have each member pull “what’s new in groups I’m in” from a shared timeline. WhatsApp doesn’t do that, and shouldn’t: with a persistent connection model, push delivery is the whole point. Pull-on-read would mean the message sits unseen until the client polls. Push has a few-millisecond latency budget and a pull doesn’t.

Per-recipient receipts

Receipts (chapter 6) are per-recipient. For a group of 10, one message produces up to 10 delivered rows and up to 10 read rows in the MessageStatus store. The sender’s UI rolls them up: “Read by 7 of 10,” with details on tap.

The receipt fanout is the same shape as messages but in reverse — each recipient’s DELIVERED and READ frames arrive at the Receipt Service, which updates that recipient’s row, then pushes a single aggregate update to the sender. The sender doesn’t get 10 separate STATUS frames for one group message; the Receipt Service coalesces them into a periodic rollup. Without this, a big group of fast readers would flood the sender with receipt frames.

Sending as a member

Group sends use the same client frame as 1:1, with group_id in place of recipient_id. The server skips the pair-hash derivation and uses group_id directly as the conversation_id. The client doesn’t need to know how many members there are. The only thing the client adds is permission awareness: if the group is set to “admins-only post” and the user isn’t an admin, the client greys out the input. The server still enforces the rule — never trust the client — but checking up front avoids a UI that lets you type a message that will be rejected.

What stays the same, what changes

The send path differs from 1:1 in exactly two ways: the member lookup at the front, and the per-recipient enqueue at the back. Everything else — Snowflake ID, durable write, idempotency, the receipt state machine — is unchanged. The pending queue, the session store, and the gateway tier carry groups for free.

What this design does not handle is groups with thousands of members, where fanout-on-write becomes prohibitive and receipt aggregation gets harder. That’s the next chapter.