Design WhatsApp - Messaging Platform
Explain Like I'm 5
Imagine you have a magical walkie-talkie that can send messages to your friends instantly, no matter where they are in the world! You can type a message like "Hi!" and press send, and BOOM - your friend gets it right away, even if they're in another country! It's like having a super-fast mailman who delivers your letters in less than a second! You can also send pictures of your drawings, voice messages where you talk, and even see a checkmark that tells you if your friend has read your message! The magic part is that your messages are locked in a special box (encryption) so only you and your friend can read them - not even the mailman can peek inside!
Key Features
- •Send and receive text messages in real-time
- •Send photos, videos, voice messages, and documents
- •End-to-end encryption for privacy
- •Message delivery status (sent, delivered, read receipts)
- •Group chats with up to 256 participants
- •Online/offline status and last seen
- •Message history stored on device and backed up
Requirements
Functional Requirements:
- •One-to-one messaging with real-time delivery
- •Group messaging with multiple participants
- •Send multimedia content (images, videos, voice)
- •Message status indicators (sent ✓, delivered ✓✓, read 🔵🔵)
- •End-to-end encryption for all messages
- •Offline message storage and sync when online
Non-Functional Requirements:
- •Low latency (<200ms message delivery)
- •High availability (99.99% uptime)
- •Scalable to 2 billion+ users
- •Minimal data usage for users with limited bandwidth
- •Support for offline message queuing
Capacity Estimation
Assumptions:
- •2 billion total users worldwide
- •500 million daily active users
- •Each user sends 40 messages per day on average
- •20% of messages contain media (images, videos)
- •Average message size: 100 bytes (text), 500KB (media)
Storage Calculation:
- •Daily messages: 500M users × 40 msgs = 20 billion messages/day
- •Text storage: 20B × 0.8 × 100 bytes = 1.6TB/day
- •Media storage: 20B × 0.2 × 500KB = 2PB/day
- •Total per year: (1.6TB + 2PB) × 365 ≈ 730PB/year
Bandwidth Calculation:
- •Messages per second: 20B / 86400 ≈ 230,000 msgs/sec
- •Peak traffic (3x average): ~700,000 msgs/sec
- •Bandwidth: 700K × (100 bytes + 100KB media) ≈ 70GB/sec
API Design
1. Send Message:
POST /api/v1/messages/send
Request:
{
"sender_id": 123456789,
"receiver_id": 987654321,
"message_type": "text",
"content": "Hello! How are you?",
"client_message_id": "msg_abc123",
"timestamp": 1698765432000
}
Response:
{
"message_id": "msg_server_xyz789",
"status": "sent",
"timestamp": 1698765432100
}2. Receive Messages (WebSocket):
// Client subscribes to WebSocket
WS /api/v1/messages/subscribe?user_id=987654321
// Server pushes new messages
{
"message_id": "msg_server_xyz789",
"sender_id": 123456789,
"receiver_id": 987654321,
"message_type": "text",
"content": "Hello! How are you?",
"timestamp": 1698765432100,
"encrypted": true
}3. Update Delivery Status:
POST /api/v1/messages/status
Request:
{
"message_id": "msg_server_xyz789",
"receiver_id": 987654321,
"status": "delivered" // or "read"
}
Response:
{
"success": true,
"updated_at": 1698765433000
}4. Send Media:
POST /api/v1/media/upload
Content-Type: multipart/form-data
Request:
{
"sender_id": 123456789,
"file": <binary data>,
"file_type": "image/jpeg"
}
Response:
{
"media_id": "media_abc123",
"media_url": "https://cdn.whatsapp.com/...",
"thumbnail_url": "https://cdn.whatsapp.com/thumb/...",
"file_size": 524288
}Database Design
Users Table (PostgreSQL):
CREATE TABLE users ( user_id BIGINT PRIMARY KEY, phone_number VARCHAR(20) UNIQUE NOT NULL, username VARCHAR(100), profile_photo_url VARCHAR(255), status_message TEXT, is_online BOOLEAN DEFAULT FALSE, last_seen TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, INDEX idx_phone (phone_number), INDEX idx_online (is_online) );
Messages Table (Cassandra - High Write Throughput):
CREATE TABLE messages ( message_id UUID PRIMARY KEY, sender_id BIGINT, receiver_id BIGINT, conversation_id UUID, message_type TEXT, -- 'text', 'image', 'video', 'voice' content TEXT, media_url TEXT, encrypted_content BLOB, status TEXT, -- 'sent', 'delivered', 'read' timestamp TIMESTAMP, PRIMARY KEY ((conversation_id), timestamp, message_id) ) WITH CLUSTERING ORDER BY (timestamp DESC); -- Query messages by conversation efficiently -- Recent messages appear first
Conversations Table:
CREATE TABLE conversations ( conversation_id UUID PRIMARY KEY, participant_ids LIST<BIGINT>, conversation_type TEXT, -- 'one_to_one', 'group' last_message_id UUID, last_message_timestamp TIMESTAMP, created_at TIMESTAMP, INDEX idx_participants (participant_ids) );
Groups Table:
CREATE TABLE groups ( group_id UUID PRIMARY KEY, group_name VARCHAR(255), group_icon_url VARCHAR(255), admin_ids LIST<BIGINT>, member_ids LIST<BIGINT>, created_at TIMESTAMP, INDEX idx_members (member_ids) );
High-Level Architecture
┌──────────────┐ ┌──────────────┐
│ Client 1 │ │ Client 2 │
│ (Mobile) │ │ (Mobile) │
└──────┬───────┘ └──────┬───────┘
│ │
│ WebSocket / HTTP │
└──────────┬───────────┘
│
┌──────▼──────┐
│ Load │
│ Balancer │
└──────┬──────┘
│
┌──────────┼──────────┐
│ │ │
┌──────▼──────┐ ┌▼─────────┐ ┌▼──────────┐
│ WebSocket │ │ Message │ │ Media │
│ Server │ │ Service │ │ Service │
│ (Real-time)│ │ (API) │ │ (Upload) │
└──────┬──────┘ └┬─────────┘ └┬──────────┘
│ │ │
└─────────┼────────────┘
│
┌─────────┼────────────┐
│ │ │
┌──────▼──────┐ ┌▼────────┐ ┌▼──────────┐
│ Redis │ │Cassandra│ │PostgreSQL │
│ (Online │ │(Messages│ │ (Users) │
│ Presence) │ │ Storage)│ │ │
└─────────────┘ └┬────────┘ └───────────┘
│
┌─────▼─────┐
│ Kafka │
│ (Events) │
└─────┬─────┘
│
┌───────┴────────┐
│ │
┌────▼─────┐ ┌────▼────┐
│ S3 │ │ CDN │
│ (Media │ │ (Media │
│ Storage) │ │Delivery)│
└──────────┘ └─────────┘Deep Dive: Key Components
1. Message Delivery System (Real-Time)
Messages must be delivered instantly when the recipient is online, or stored for later delivery when offline.
public class MessageDeliveryService { private WebSocketConnectionPool wsPool; private MessageQueue offlineQueue; private CassandraClient cassandra; private RedisCache redis; private KafkaProducer kafkaProducer; /** * Sends a message from sender to receiver. * Delivers immediately if online, queues if offline. */ public MessageResponse sendMessage(SendMessageRequest request) { // 1. Generate unique message ID String messageId = UUID.randomUUID().toString(); long timestamp = System.currentTimeMillis(); // 2. Encrypt message content (end-to-end encryption) byte[] encryptedContent = encryptMessage( request.getContent(), request.getReceiverId() ); // 3. Create message object Message message = new Message( messageId, request.getSenderId(), request.getReceiverId(), request.getMessageType(), encryptedContent, timestamp ); // 4. Store message in Cassandra (persistent storage) cassandra.insertMessage(message); // 5. Check if receiver is online boolean isReceiverOnline = redis.get("user:online:" + request.getReceiverId()); if (isReceiverOnline) { // 6a. Receiver is ONLINE - send via WebSocket immediately WebSocketConnection receiverConnection = wsPool.getConnection( request.getReceiverId() ); if (receiverConnection != null && receiverConnection.isOpen()) { receiverConnection.send(message); message.setStatus("delivered"); // Update status in database cassandra.updateMessageStatus(messageId, "delivered"); } else { // Connection dropped, queue for later queueOfflineMessage(request.getReceiverId(), messageId); message.setStatus("sent"); } } else { // 6b. Receiver is OFFLINE - queue message queueOfflineMessage(request.getReceiverId(), messageId); message.setStatus("sent"); } // 7. Send delivery receipt to sender notifySender(request.getSenderId(), messageId, message.getStatus()); // 8. Publish event to Kafka for analytics kafkaProducer.send("message-sent", new MessageEvent( messageId, request.getSenderId(), request.getReceiverId(), timestamp )); return new MessageResponse(messageId, message.getStatus(), timestamp); } /** * Handles user coming online - delivers queued messages. */ public void handleUserOnline(long userId) { System.out.println("User " + userId + " came online"); // 1. Mark user as online in Redis redis.set("user:online:" + userId, true, 3600); // 1 hour TTL // 2. Get all queued messages for this user List<String> queuedMessageIds = offlineQueue.getMessages(userId); if (queuedMessageIds.isEmpty()) { return; } System.out.println("Delivering " + queuedMessageIds.size() + " queued messages to user " + userId); // 3. Get WebSocket connection WebSocketConnection connection = wsPool.getConnection(userId); if (connection == null || !connection.isOpen()) { System.err.println("Failed to get connection for user " + userId); return; } // 4. Fetch messages from Cassandra and deliver for (String messageId : queuedMessageIds) { Message message = cassandra.getMessage(messageId); if (message != null) { // Send message via WebSocket connection.send(message); // Update status to delivered cassandra.updateMessageStatus(messageId, "delivered"); // Notify sender about delivery notifySender(message.getSenderId(), messageId, "delivered"); } // Remove from queue offlineQueue.removeMessage(userId, messageId); } } /** * Queues message for offline user. */ private void queueOfflineMessage(long userId, String messageId) { offlineQueue.addMessage(userId, messageId); // Also store in Redis sorted set with timestamp for quick access redis.zadd("offline:messages:" + userId, System.currentTimeMillis(), messageId); } /** * Encrypts message using receiver's public key (end-to-end encryption). */ private byte[] encryptMessage(String content, long receiverId) { // In reality: use Signal Protocol or similar E2E encryption // For demo, simplified PublicKey receiverPublicKey = getPublicKey(receiverId); return CryptoUtil.encrypt(content.getBytes(), receiverPublicKey); } /** * Notifies sender about message status update. */ private void notifySender(long senderId, String messageId, String status) { WebSocketConnection senderConnection = wsPool.getConnection(senderId); if (senderConnection != null && senderConnection.isOpen()) { senderConnection.send(new StatusUpdate(messageId, status)); } }}2. Group Messaging
Group messages must be delivered to multiple recipients efficiently. Use fan-out pattern to send to all members.
public class GroupMessagingService { private MessageDeliveryService messageDelivery; private CassandraClient cassandra; private RedisCache redis; /** * Sends a message to a group chat. * Fans out message to all group members. */ public GroupMessageResponse sendGroupMessage(GroupMessageRequest request) { String groupId = request.getGroupId(); long senderId = request.getSenderId(); // 1. Verify sender is a member of the group Group group = getGroup(groupId); if (!group.getMembers().contains(senderId)) { throw new UnauthorizedException("User not in group"); } // 2. Generate message ID and timestamp String messageId = UUID.randomUUID().toString(); long timestamp = System.currentTimeMillis(); // 3. Store group message once GroupMessage groupMessage = new GroupMessage( messageId, groupId, senderId, request.getContent(), timestamp ); cassandra.insertGroupMessage(groupMessage); // 4. Fan out to all group members (except sender) List<Long> recipients = group.getMembers().stream() .filter(memberId -> memberId != senderId) .collect(Collectors.toList()); System.out.println("Fanning out message to " + recipients.size() + " members"); // 5. Use parallel processing for large groups if (recipients.size() > 50) { // Large group: use async processing fanOutAsync(messageId, recipients, groupMessage); } else { // Small group: send synchronously fanOutSync(messageId, recipients, groupMessage); } // 6. Return success to sender return new GroupMessageResponse(messageId, "sent", timestamp); } /** * Synchronous fan-out for small groups. */ private void fanOutSync(String messageId, List<Long> recipients, GroupMessage message) { for (Long recipientId : recipients) { try { // Check if recipient is online boolean isOnline = redis.get("user:online:" + recipientId); if (isOnline) { // Deliver via WebSocket WebSocketConnection conn = wsPool.getConnection(recipientId); if (conn != null && conn.isOpen()) { conn.send(message); } } else { // Queue for later delivery offlineQueue.addMessage(recipientId, messageId); } } catch (Exception e) { System.err.println("Failed to deliver to " + recipientId + ": " + e.getMessage()); } } } /** * Asynchronous fan-out for large groups (>50 members). */ private void fanOutAsync(String messageId, List<Long> recipients, GroupMessage message) { // Split recipients into batches of 100 int batchSize = 100; List<List<Long>> batches = Lists.partition(recipients, batchSize); // Process batches in parallel using thread pool ExecutorService executor = Executors.newFixedThreadPool(10); for (List<Long> batch : batches) { executor.submit(() -> { fanOutSync(messageId, batch, message); }); } executor.shutdown(); } /** * Gets group information from cache or database. */ private Group getGroup(String groupId) { // Try cache first String cacheKey = "group:" + groupId; Group cached = redis.get(cacheKey); if (cached != null) { return cached; } // Cache miss - fetch from database Group group = cassandra.getGroup(groupId); if (group != null) { // Cache for 1 hour redis.setWithExpiry(cacheKey, group, 3600); } return group; } /** * Adds member to group. */ public void addMemberToGroup(String groupId, long userId, long adminId) { Group group = getGroup(groupId); // Verify admin permissions if (!group.getAdmins().contains(adminId)) { throw new UnauthorizedException("Only admins can add members"); } // Add member to group group.getMembers().add(userId); cassandra.updateGroup(group); // Invalidate cache redis.delete("group:" + groupId); // Send system message to group sendSystemMessage(groupId, userId + " joined the group"); } /** * Sends system message to group (e.g., "Alice added Bob"). */ private void sendSystemMessage(String groupId, String content) { GroupMessageRequest systemMsg = new GroupMessageRequest( groupId, 0, // System sender "system", content ); sendGroupMessage(systemMsg); }}3. Message Status & Read Receipts
Track message delivery and read status with checkmarks (✓ sent, ✓✓ delivered, 🔵🔵 read).
public class MessageStatusService { private CassandraClient cassandra; private WebSocketConnectionPool wsPool; private RedisCache redis; /** * Updates message status when user receives/reads a message. */ public void updateMessageStatus(String messageId, long userId, MessageStatus newStatus) { // 1. Fetch message from database Message message = cassandra.getMessage(messageId); if (message == null) { System.err.println("Message not found: " + messageId); return; } // 2. Verify user is the receiver if (message.getReceiverId() != userId) { System.err.println("User " + userId + " not authorized for message " + messageId); return; } // 3. Update status (only allow forward progression) MessageStatus currentStatus = message.getStatus(); if (!canTransition(currentStatus, newStatus)) { System.err.println("Invalid status transition: " + currentStatus + " -> " + newStatus); return; } // 4. Update in database message.setStatus(newStatus); message.setStatusUpdatedAt(System.currentTimeMillis()); cassandra.updateMessage(message); // 5. Send status update to sender (blue checkmarks!) notifySenderOfStatusChange(message.getSenderId(), messageId, newStatus); // 6. Update last seen if status is "read" if (newStatus == MessageStatus.READ) { updateLastSeen(userId); } } /** * Checks if status transition is valid. * Status progression: SENT -> DELIVERED -> READ */ private boolean canTransition(MessageStatus current, MessageStatus next) { // Status hierarchy int currentLevel = getStatusLevel(current); int nextLevel = getStatusLevel(next); // Can only move forward, not backward return nextLevel > currentLevel; } private int getStatusLevel(MessageStatus status) { switch (status) { case SENT: return 1; case DELIVERED: return 2; case READ: return 3; default: return 0; } } /** * Sends status update notification to sender. * This is how the blue checkmarks appear! */ private void notifySenderOfStatusChange(long senderId, String messageId, MessageStatus status) { WebSocketConnection senderConn = wsPool.getConnection(senderId); if (senderConn != null && senderConn.isOpen()) { // Send status update message StatusUpdateNotification notification = new StatusUpdateNotification( messageId, status.toString().toLowerCase(), System.currentTimeMillis() ); senderConn.send(notification); System.out.println("Sent status update to sender " + senderId + ": " + messageId + " is now " + status); } else { // Sender offline - they'll see the updated status when they come online System.out.println("Sender " + senderId + " offline, will sync later"); } } /** * Marks multiple messages as read at once (bulk operation). */ public void markConversationAsRead(long userId, String conversationId) { // 1. Get all unread messages in conversation List<Message> unreadMessages = cassandra.getUnreadMessages( conversationId, userId ); if (unreadMessages.isEmpty()) { return; } System.out.println("Marking " + unreadMessages.size() + " messages as read in conversation " + conversationId); // 2. Batch update all messages to READ status List<String> messageIds = unreadMessages.stream() .map(Message::getMessageId) .collect(Collectors.toList()); cassandra.batchUpdateStatus(messageIds, MessageStatus.READ); // 3. Notify senders (group by sender to avoid duplicate notifications) Map<Long, List<String>> messagesBySender = unreadMessages.stream() .collect(Collectors.groupingBy( Message::getSenderId, Collectors.mapping(Message::getMessageId, Collectors.toList()) )); for (Map.Entry<Long, List<String>> entry : messagesBySender.entrySet()) { long senderId = entry.getKey(); List<String> senderMessageIds = entry.getValue(); // Send single notification with all message IDs notifyBulkStatusChange(senderId, senderMessageIds, MessageStatus.READ); } // 4. Update last seen updateLastSeen(userId); } /** * Sends bulk status update for multiple messages. */ private void notifyBulkStatusChange(long senderId, List<String> messageIds, MessageStatus status) { WebSocketConnection conn = wsPool.getConnection(senderId); if (conn != null && conn.isOpen()) { BulkStatusUpdate update = new BulkStatusUpdate( messageIds, status.toString().toLowerCase(), System.currentTimeMillis() ); conn.send(update); } } /** * Updates user's last seen timestamp. */ private void updateLastSeen(long userId) { long timestamp = System.currentTimeMillis(); // Update in Redis for fast access redis.set("user:last_seen:" + userId, timestamp); // Also update in database (async) CompletableFuture.runAsync(() -> { cassandra.updateUserLastSeen(userId, timestamp); }); } /** * Gets user's online status and last seen. */ public UserStatus getUserStatus(long userId) { // Check if online boolean isOnline = redis.get("user:online:" + userId); if (isOnline) { return new UserStatus(userId, true, "online"); } else { // Get last seen from cache Long lastSeen = redis.get("user:last_seen:" + userId); if (lastSeen == null) { // Cache miss - fetch from database lastSeen = cassandra.getUserLastSeen(userId); if (lastSeen != null) { redis.set("user:last_seen:" + userId, lastSeen); } } return new UserStatus(userId, false, formatLastSeen(lastSeen)); } } /** * Formats last seen timestamp (e.g., "5 minutes ago", "yesterday"). */ private String formatLastSeen(Long timestamp) { if (timestamp == null) { return "last seen a long time ago"; } long now = System.currentTimeMillis(); long diffSeconds = (now - timestamp) / 1000; if (diffSeconds < 60) { return "last seen just now"; } else if (diffSeconds < 3600) { return "last seen " + (diffSeconds / 60) + " minutes ago"; } else if (diffSeconds < 86400) { return "last seen " + (diffSeconds / 3600) + " hours ago"; } else { return "last seen " + (diffSeconds / 86400) + " days ago"; } } enum MessageStatus { SENT, DELIVERED, READ }}Trade-offs and Optimizations
1. WebSocket vs Long Polling
WebSocket: True real-time, persistent connection, better for active users. Long Polling: Fallback for restricted networks, higher latency. Use WebSocket with long polling fallback.
2. Store-and-Forward vs Direct Delivery
Store-and-Forward: Always persist message first (WhatsApp approach), guarantees delivery even if server crashes. Direct Delivery: Faster but can lose messages. WhatsApp uses store-and-forward.
3. Message Storage Duration
Store forever: Users can search old messages, expensive storage. Store limited time: Cheaper, but users lose history. WhatsApp stores messages on device, optional cloud backup.
4. Group Message Fan-out
Synchronous: Deliver to all immediately, slower for large groups. Asynchronous: Queue for background workers, faster response but delayed delivery. Use async for groups >50 members.
Optimizations:
- ✓Use Redis for online presence with TTL (auto-expire when user disconnects)
- ✓Batch status updates (send one notification for 10 read receipts)
- ✓Use Cassandra for messages (optimized for write-heavy workloads)
- ✓Implement message compression (reduce bandwidth by 60%)
- ✓Use CDN for media delivery (images, videos cached at edge)
- ✓Connection pooling for WebSocket servers (handle millions of connections)
Follow-up Interview Questions
Q: How do you implement end-to-end encryption?
A: Use Signal Protocol. Each user has public/private key pair. Messages encrypted with receiver's public key, only they can decrypt with private key. Keys stored only on device, never on server. Implement Perfect Forward Secrecy (PFS) with ephemeral keys that change per message.
Q: How do you handle messages when both users are offline?
A: Store message in Cassandra immediately. Add to offline message queue in Redis for both sender and receiver. When either comes online, deliver from queue. Sender gets 'sent' status, receiver gets queued message, sender then gets 'delivered' status.
Q: How would you implement message search?
A: Local search: Index messages on device using SQLite FTS (Full-Text Search). Server search: Use Elasticsearch for backed-up messages, shard by user_id. Challenge with E2E encryption: Can't search encrypted content on server. Solution: Client-side decryption + search, or encrypt search index with user's key.
Q: How do you handle network failures during message send?
A: Implement retry mechanism with exponential backoff. Store pending messages locally with 'sending' status. Keep trying to send (retry every 2s, then 4s, 8s, up to 30s). If still fails after 5 minutes, show 'failed' icon, let user manually retry. Use message_id for deduplication (don't send twice).
Q: How would you scale WebSocket servers to millions of connections?
A: Use multiple WebSocket servers behind load balancer with consistent hashing (route user_id to specific server). Each server handles ~100K connections. Use Redis Pub/Sub for server-to-server communication (route messages between servers). Implement connection pooling and keep-alive pings. Monitor CPU/memory, auto-scale based on active connections.
Real-World Implementation
WhatsApp's actual architecture:
- ✓Erlang for messaging servers (handles millions of concurrent connections)
- ✓XMPP protocol (modified) for real-time communication
- ✓Signal Protocol for end-to-end encryption
- ✓Cassandra for message storage (billions of messages/day)
- ✓Redis for online presence and message queues
- ✓FreeBSD servers optimized for network throughput
- ✓Client-side SQLite for local message storage
- ✓Media stored in Facebook's infrastructure (S3-like)