Every other chapter assumes the client already knows the recipient’s user_id. It doesn’t — the user only knows phone numbers. So before any messaging works, the app has to translate “the 200 phone numbers in my address book” into “the subset of those who are on WhatsApp, and their internal user IDs.” That’s contact sync.
Architecture
flowchart TD
Client(["Client"])
Dir["User Directory
(Fargate)"]
Store[("Directory Store
(DynamoDB)
PK phone_hash
attr user_id")]
Client -->|"① POST hashed numbers"| Dir
Dir -->|"② batch lookup"| Store
Dir -->|"③ return user_ids
for matches"| Client
The flow
On install, and periodically afterwards (once a day is plenty), the client:
- Reads phone numbers from the OS address book.
- Hashes each one — SHA-256 over the E.164-normalised number.
- Batches them into a single request to the User Directory service.
The directory does a batched read against the Directory Store keyed by phone_hash. Numbers with a row come back as (phone_hash, user_id) pairs; numbers without a row are simply absent from the response. The client merges the result into a local cache keyed by raw phone number, so tapping “Alice” in the contact list resolves to her user_id instantly with no network round trip on the send path.
Why hash, and what it actually protects
Hashing isn’t real anonymisation — phone numbers come from a small, enumerable space (a few billion globally), so a server willing to spend the compute can reverse any hash by trying every number. What hashing does buy:
- Bystanders are protected. Hashes of numbers that don’t belong to any account are opaque to the server — it can’t easily tell whose number they are. People who’ve never installed the app don’t get their phone numbers harvested just because they’re in someone else’s contacts. With raw uploads, the server would learn every number in every address book it sees.
- No plaintext address book on disk. A leaked database dump doesn’t immediately give attackers a human-readable list of who knows whom.
The privacy posture is “minimum needed to do the matching,” not “uncrackable.”
Storage
DynamoDB, partition key phone_hash, single attribute user_id. The access pattern is point-lookup by hash and that’s it — no scans, no range reads, no secondary indexes needed. A new user being added means one row insert at signup; a user deleting their account means one row delete. Read volume is the dominant cost: every install and every periodic refresh fires a batch read containing hundreds of items.
A relational database buys nothing here. There are no joins, no transactions, no secondary access patterns — just a billions-of-rows hash table where every read is WHERE phone_hash = ?. To handle the scale, a relational engine would need manual sharding by phone_hash, which is exactly what DynamoDB does for you. Pick the database whose shape matches the workload.
The directory service is stateless and runs on Fargate — it’s just a fan-out over a batched key-value lookup with a thin layer of rate limiting per user.
What this enables
With the cache populated, the rest of the system can treat the recipient’s user_id as known. Chapter 4 (sending) drops that user_id straight into the SEND frame; chapter 9 (groups) lets users build a group by picking from the same cached contacts. None of the messaging chapters need to talk to the directory on the hot path.