10. Large groups and broadcast

Architecture

flowchart TD
Client_A(["Client A"])
GW["Edge Gateway
(EC2, WebSocket)"]
MS["Message Service
(Fargate)"]
Hist[("Message Store
(DynamoDB)")]
PQ[("Pending Queues
(DynamoDB)
active members only")]
PullQ[/"Inactive members
pull on connect"/]
RA["Receipt
Aggregator
(Fargate)"]

Client_A --> GW --> MS
MS -->|"write once"| Hist
MS -->|"enqueue active"| PQ
Hist -.->|"on reconnect"| PullQ
MS -.->|"receipts"| RA
RA --> Client_A

Where the chapter 9 design breaks

The fanout-on-write pattern from chapter 9 enqueues one envelope per member. For a group of 256, that’s 256 cheap writes. For a community broadcast list of 100,000, one send fans out to 100,000 queue writes — and a community of active senders multiplies this. A handful of large groups can dominate the entire write load of the queue store.

Receipts get worse. A 100,000-member group where 10% read the message in the first minute means 10,000 READ frames flowing back through the Receipt Service, all targeting the sender’s UI. Even coalesced, that’s a flood.

Cap the fanout, defer the rest

The trick is that not every member is worth a queue write. Most large groups have a long tail of mostly-inactive members who haven’t opened the app in days or weeks. Their pending queue would just hold the message until they reconnect — at which point they’d drain it sequentially anyway.

Split members into active and inactive:

Inactive members catch up on reconnect by reading the shared message history (the canonical write the Message Service already does, partitioned by conversation_id). The client tracks a per-conversation last_seen_message_id and asks: “give me messages in this group since X.” It’s a single partition scan in the message store. No queue write was needed at send time.

This shifts the cost: active members get push delivery, inactive members get pull-on-reconnect. Total work scales with active membership, not total membership.

Receipt aggregation

For large groups, drop per-recipient receipts on the wire and replace them with counts. The Receipt Service still records per-recipient status rows (so “Read by” details work on tap), but what it pushes back to the sender is a periodic aggregate:

STATUS_ROLLUP
  message_id: m_123
  delivered: 42,310
  read: 8,402

Pushed at most every few seconds per group, not per receipt. The sender’s UI shows “8.4K read” and updates as the rollup advances. Tapping “Read by” issues a one-shot read to the per-recipient rows — paid only when the user asks.

Some products (WhatsApp, in fact) just disable per-recipient receipts above a threshold. That’s a valid product decision and falls out cleanly from this architecture: above N members, stop computing the rollup and stop returning per-recipient status entirely.

Broadcast lists

A broadcast list is a different beast: the sender wants to message N people, but each recipient sees it as a 1:1 chat with the sender. There’s no shared conversation. From the system’s perspective it’s pure fanout — N independent 1:1 sends sharing one source frame.

Implementation:

  1. Sender uploads the message body once (and any media once, dedup’d by content hash from chapter 8).
  2. The Message Service iterates the list, minting a distinct message_id per recipient, writing to each recipient’s 1:1 conversation history, and enqueueing per recipient.
  3. From each recipient’s view, it’s an ordinary 1:1 message. No group_id, no member list visible.

The optimisation: the body is materialised once (a content-addressed blob) and referenced by N message rows. For a long broadcast text or a media item, this avoids storing the same body N times.

Hot-sender protection

The other large-group failure mode is a single user sending many messages quickly to a huge group. The gateway already has per-user rate limits (chapter 4), but those are tuned for normal use. For broadcast and large-community senders, layer a per-(sender, group) token bucket on top — enforced at the Message Service before the fanout step. Catches the case where a script sends 100 messages a second to a 100,000-member community and would otherwise generate 10M queue writes per second from one user.

What this looks like to users

For active members, identical to chapter 9. For inactive members, slightly slower catch-up — they pull from history on reconnect instead of finding the messages already queued. For senders to large groups, receipts collapse to a count and individual ticks disappear. All acceptable tradeoffs at the scale where they kick in.