Case Study: Chat System¶
Design a real-time chat system (like WhatsApp/Slack). Core: WebSockets for real-time messaging, message queue for delivery guarantees, database for persistence. Key features: 1:1 and group chat, online presence, read receipts, push notifications. Use WebSocket servers behind a load balancer, Kafka for message routing, Cassandra for message storage (write-heavy, time-series).
System Design¶
Step 1: Requirements
Functional: - 1:1 messaging and group chat - Online/offline status - Message history - Read receipts - Push notifications for offline users
Non-Functional: - Low latency (< 100ms delivery) - At-least-once delivery - Messages must be persistent and ordered - Handle millions of concurrent connections
Step 2: High-Level Architecture
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Client │←ws→│ Chat Server │←→ │ Message │
│ (App) │ │ (WebSocket) │ │ Queue │
└──────────┘ └──────┬───────┘ │ (Kafka) │
│ └──────┬───────┘
┌──────┴───────┐ │
│ Presence │ ┌──────┴───────┐
│ Service │ │ Message DB │
│ (Redis) │ │ (Cassandra) │
└──────────────┘ └──────────────┘
Step 3: Real-Time Communication
WebSocket for persistent, bidirectional connection:
1. Client connects via WebSocket
2. Server maintains mapping: userId → WebSocket connection
3. On message send:
sender → Chat Server → route to recipient's Chat Server → recipient
Why not HTTP polling? | Method | Latency | Resource Usage | |--------|---------|---------------| | HTTP Polling | High (interval) | Wasteful | | Long Polling | Medium | Moderate | | WebSocket | Low (real-time) | Efficient |
Step 4: Message Flow
1. Sender sends message via WebSocket
2. Chat Server stores in Kafka (durability)
3. Chat Server stores in Cassandra (persistence)
4. If recipient online → deliver via WebSocket
5. If recipient offline → push notification
6. Recipient acknowledges receipt → read receipt
Message storage (Cassandra):
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID,
sender_id UUID,
content TEXT,
created_at TIMESTAMP,
PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
chat_id, ordered by time. Efficient for "get last N messages."
Step 5: Presence & Scaling
Online presence (Redis):
SETEX presence:user123 30 "online" # 30 second TTL
# Client sends heartbeat every 10 seconds to refresh
Scaling WebSocket servers: - Each server handles ~100K connections - Use consistent hashing to route users to servers - Message routing between servers via Kafka
Common Follow-up Questions
- How do you handle message ordering?
- How do you implement group chat?
- How do you handle offline users?
- How do you implement end-to-end encryption?
- How do you handle media (images, videos)?