Home/System Design/Scalability

Scalability

🍕 Explain Like I'm 5: The Restaurant Story

Imagine you have a pizza restaurant that's getting more and more customers! How do you handle all those hungry people?

🏪 Small Restaurant (10 customers/day):

✓ One chef, one oven, one waiter
✓ Takes orders, makes pizzas, serves customers
✓ Everything works great!

🏬 Big Restaurant (1000 customers/day):

✗ Same chef gets overwhelmed - lines get SUPER long!
✗ Customers wait hours for pizza
✗ Chef is tired, makes mistakes

💡 How To Fix This?

Solution 1: Get a BIGGER Oven (Vertical Scaling)

Buy a super-sized oven that can make 10 pizzas at once! More powerful equipment for the same chef.

Solution 2: Hire MORE Chefs (Horizontal Scaling)

Hire 5 more chefs, each with their own oven! Multiple workers doing the same job.

Solution 3: Smart Door Person (Load Balancer)

Someone at the door sends customers to the chef who's least busy!

Solution 4: Hire Part-Time Chefs (Auto-scaling)

Call in extra chefs during lunch rush, send them home when it's quiet!

What is Scalability?

Scalability is the ability of a system to handle growing amounts of work by adding resources. It's about making sure your application can serve 10 users, 1000 users, or 1 million users without breaking!

Types of Scaling

1. Vertical Scaling (Scale Up)

Add more power to your existing machine - more CPU, RAM, storage.

Before (Small Server):        After (Bigger Server):
┌──────────────┐             ┌──────────────────┐
│  2 CPU Cores │    ═════>   │   16 CPU Cores   │
│    4GB RAM   │             │      64GB RAM    │
│   100GB SSD  │             │     2TB SSD      │
└──────────────┘             └──────────────────┘

Pros:

✓ Simple to implement - just upgrade the server
✓ No application code changes needed
✓ Data consistency is easier (everything on one machine)

Cons:

✗ Hardware limits - can't upgrade forever!
✗ Single point of failure - if server dies, everything stops
✗ Expensive - high-end servers cost a LOT
✗ Downtime required for upgrades

2. Horizontal Scaling (Scale Out)

Add more machines to distribute the load across multiple servers.

Before (1 Server):           After (Multiple Servers):
                             ┌──────────┐
    ┌──────────┐             │ Server 1 │
    │ Server 1 │   ═════>    ├──────────┤
    │  100%    │             │ Server 2 │
    │  LOAD    │             ├──────────┤
    └──────────┘             │ Server 3 │
                             ├──────────┤
                             │ Server 4 │
                             └──────────┘
                             Each: 25% load

Pros:

✓ No hardware limits - keep adding servers!
✓ Fault tolerant - if one server fails, others continue
✓ Cost effective - use commodity hardware
✓ Can scale infinitely (in theory)

Cons:

✗ More complex to implement
✗ Need load balancers and distributed systems
✗ Data consistency challenges
✗ Network overhead between servers

Load Balancing

Distributes incoming traffic across multiple servers to ensure no single server gets overwhelmed.

                    ┌─────────────────┐
    Users  ────>    │ Load Balancer   │
   🧑🧑🧑🧑           │  (Distributor)  │
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
    ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐
    │ Server 1  │     │ Server 2  │     │ Server 3  │
    │  (33%)    │     │  (33%)    │     │  (33%)    │
    └───────────┘     └───────────┘     └───────────┘

Load Balancing Strategies:

• Round Robin: Send requests to servers in rotation (Server1 → Server2 → Server3 → Server1...)
• Least Connections: Send to server with fewest active connections
• IP Hash: Use client IP to determine which server handles the request
• Weighted Round Robin: Some servers get more traffic based on their capacity

Auto-scaling

Automatically add or remove servers based on current demand. Save money during low traffic, handle spikes during high traffic.

Morning (Low Traffic):      Evening (High Traffic):
2 Servers Running           6 Servers Running
┌────┐ ┌────┐              ┌────┐ ┌────┐ ┌────┐
│ S1 │ │ S2 │              │ S1 │ │ S2 │ │ S3 │
└────┘ └────┘              └────┘ └────┘ └────┘
 30%    30%                ┌────┐ ┌────┐ ┌────┐
                           │ S4 │ │ S5 │ │ S6 │
Cost: $10/hr               └────┘ └────┘ └────┘
                            60%    60%    60%
                           Cost: $30/hr (only during peak!)

How it Works:

1️⃣ Monitor metrics: CPU usage, memory, request count, response time
2️⃣ Set thresholds: If CPU > 80% for 5 minutes, add a server
3️⃣ Scale up: Launch new instances automatically
4️⃣ Scale down: Terminate instances when traffic decreases

Result: You only pay for what you need, exactly when you need it!

Stateless vs Stateful Applications

Stateless Application:

Each request is independent. Server doesn't remember previous requests.

Request 1 → Server A
Request 2 → Server B
Request 3 → Server A
✓ All work fine!

Example: Google Search - each search is independent, doesn't matter which server handles it

Easy to scale horizontally - any server can handle any request!

Stateful Application:

Server remembers information about the client (session, shopping cart, etc.)

Add item → Server A
View cart → Server B
✗ Cart is empty!
(Server B doesn't know
 about Server A's data)

Example: Shopping cart - server needs to remember what you added

Challenge: User session must stick to same server OR store session externally (Redis, database)

🌍 Real-World Examples

Netflix: Streaming to 230+ Million Users

Challenge:

• Stream high-quality video to millions simultaneously
• Handle evening peak hours (everyone watching after work)
• Global audience across different time zones

Solution:

✓ Horizontal Scaling: Thousands of AWS servers worldwide
✓ CDN (Content Delivery Network): Videos cached near users
✓ Auto-scaling: Spin up servers during peak hours, down at night
✓ Microservices: Separate services for recommendations, billing, streaming

Result: Smooth streaming even when millions watch simultaneously!

Instagram: Handling Billions of Photos

Challenge:

• Store and serve billions of photos quickly
• Process photo uploads (resize, filter, thumbnails)
• Fast feed loading for 2+ billion users

Solution:

✓ Horizontal Scaling: Distributed database (Cassandra) across many servers
✓ Caching: Frequently accessed photos cached in memory (Redis)
✓ Async Processing: Upload photo immediately, process filters in background queue
✓ CDN: Images served from servers closest to users

Result: Photos load in milliseconds, even with billions stored!

Uber: Real-time Matching at Scale

Challenge:

• Match riders with nearby drivers in seconds
• Track thousands of drivers' locations in real-time
• Handle surge pricing during peak times

Solution:

✓ Geo-sharding: Divide map into regions, each region has dedicated servers
✓ Load Balancing: Route requests based on geographic location
✓ Auto-scaling: Scale up servers in busy cities during rush hour
✓ WebSocket Connections: Real-time updates for driver locations

Result: Riders matched with drivers in under 30 seconds!

Amazon: Black Friday Traffic Spikes

Challenge:

• Normal day: 200 million visits. Black Friday: 2 BILLION visits!
• 10x traffic spike in a few hours
• Can't afford downtime - every minute = millions lost

Solution:

✓ Massive Auto-scaling: Pre-provision servers weeks before, auto-scale during event
✓ Queue Systems: Orders processed asynchronously, users get immediate confirmation
✓ Database Sharding: Product catalog split across hundreds of databases
✓ Caching: Product pages cached aggressively, reduce database load by 90%

Result: Handles 10x traffic without breaking a sweat!

Common Scalability Challenges

Database Bottlenecks:

Database can't keep up with read/write requests. Solution: Replication, sharding, caching

Session Management:

Sticky sessions make scaling harder. Solution: Store sessions in Redis/database

Data Consistency:

Multiple servers = risk of showing outdated data. Solution: Eventual consistency, cache invalidation

Network Latency:

Servers far from users = slow responses. Solution: CDN, geo-distributed servers

File Storage:

Can't store files on individual servers. Solution: Object storage (S3, Cloud Storage)

Best Practices

1️⃣ Design for Horizontal Scaling:

Build stateless applications from day one, easier to scale later

2️⃣ Use Caching Aggressively:

Cache database queries, API responses, rendered pages - reduces load by 80-90%

3️⃣ Async Processing:

Move slow tasks (email, image processing) to background queues

4️⃣ Database Optimization:

Add indexes, use read replicas, consider NoSQL for specific use cases

5️⃣ Monitor Everything:

Track CPU, memory, response times, error rates - catch issues before users notice

6️⃣ Plan for Failure:

Circuit breakers, retries, fallbacks - assume things will break

Trade-offs to Consider

⚖️ Cost vs Performance:

More servers = better performance but higher costs. Find the sweet spot!

⚖️ Consistency vs Availability:

CAP theorem - can't have both perfect consistency AND availability. Choose based on needs.

⚖️ Complexity vs Simplicity:

Distributed systems are complex. Start simple, scale when needed.

⚖️ Speed vs Reliability:

Caching makes things fast but can serve stale data. Balance freshness vs speed.

Key Metrics to Monitor

Response Time:

How long does a request take? Target: < 200ms for web pages

Throughput:

Requests per second your system can handle

Error Rate:

Percentage of failed requests. Target: < 0.1%

CPU/Memory Usage:

Server resource utilization. Alert if consistently > 80%

Database Connections:

Connection pool usage. Running out = bottleneck!

Cache Hit Rate:

Percentage of requests served from cache. Target: > 80%