Home/System Design/Databases in System Design

Databases in System Design

📚 Explain Like I'm 5...

Imagine a HUGE library with millions of books! 📖 How do you organize them so everyone can find what they need quickly?

🏛️ The Library System:

📖 Books are organized by category (Fiction, Science, History)
🔢 Each book has a unique ID number (like a barcode)
🗂️ There's an INDEX at the front - like a map to find books fast!
📚 Sometimes we make COPIES of popular books so more people can read them at once
🏢 We can split books into different buildings if one library gets too full!

🚀 Why Are Databases Important?

• Store MILLIONS of pieces of information safely
• Find information in MILLISECONDS - faster than blinking!
• Keep data safe even if computers crash
• Let thousands of people use data at the same time

🔄 SQL vs NoSQL: What's the Difference?

SQL Databases (Relational)

Like organized tables in a spreadsheet - everything has a place!

✓ Structured data in tables with rows and columns
✓ Strong rules (ACID) - no messy data!
✓ Great for complex relationships (users, orders, products)
✓ Examples: PostgreSQL, MySQL, Oracle, SQL Server

NoSQL Databases (Non-Relational)

Like flexible filing cabinets - adapt to any shape of data!

✓ Flexible structure - store JSON, documents, key-values
✓ Super fast for simple queries
✓ Scales easily across many servers
✓ Examples: MongoDB, Redis, Cassandra, DynamoDB

📊 Quick Comparison

	SQL	NoSQL
Schema	Fixed (must define structure first)	Flexible (can change anytime)
Scaling	Vertical (bigger servers)	Horizontal (more servers)
Transactions	Strong ACID guarantees	Eventually consistent
Queries	Complex joins, aggregations	Simple, fast lookups
Best For	Financial systems, e-commerce	Social media, real-time apps

⚛️ ACID Properties: The Gold Standard

ACID makes sure your data stays safe and consistent, like a bank vault!

A - Atomicity (All or Nothing)

If you transfer $100, BOTH the debit AND credit must happen. If one fails, both fail!

Example: Bank transfer - money can't disappear!

C - Consistency (Rules Always Apply)

Your database must follow all rules. No negative bank balances allowed!

Example: Account balance must always be valid

I - Isolation (Don't Interfere)

Two people withdrawing money at the same time don't mess up the balance

Example: Concurrent transactions don't see each other's incomplete work

D - Durability (Never Lose Data)

Once saved, data stays saved even if power goes out!

Example: Committed transactions survive crashes

🔺 CAP Theorem: Pick Two!

In distributed systems, you can only guarantee TWO of these THREE things:

Consistency

Everyone sees the same data at the same time

Availability

System always responds (even if data is slightly old)

Partition Tolerance

System works even if network has problems

The Trade-off:

• CP (Consistency + Partition): Bank systems - better to be unavailable than wrong!
• AP (Availability + Partition): Social media - show slightly old data, but stay online!
• CA (Consistency + Availability): Single server only - no network issues!

🗄️ Types of Databases

Relational (SQL)

Tables with relationships. Best for structured data with complex queries.

Use when: Banking, e-commerce, CRM systems

PostgreSQL, MySQL, Oracle

Document Store

Store JSON-like documents. Flexible schema, easy to scale.

Use when: Content management, user profiles, catalogs

MongoDB, CouchDB, Firestore

Key-Value Store

Super simple: key → value. Extremely fast for lookups.

Use when: Caching, session storage, shopping carts

Redis, Memcached, DynamoDB

Column-Family Store

Wide columns, great for time-series and analytics.

Use when: Analytics, time-series, IoT data

Cassandra, HBase, BigTable

Graph Database

Store relationships as first-class citizens. Perfect for connected data.

Use when: Social networks, fraud detection, recommendations

Neo4j, Amazon Neptune

📈 Scaling Strategies

Database Sharding

Split your database into smaller pieces (shards) across multiple servers

How it works:

1. Horizontal partitioning - divide data by rows
2. Each shard holds a subset of data (e.g., users A-M on server 1, N-Z on server 2)
3. Distribute load and storage across servers

✓ Pros: Better performance, handles more data

✗ Cons: Complex queries, rebalancing is hard

Database Replication

Make copies of your database for reliability and speed

• Master-Slave: Master-Slave: One writes, many read (Instagram, Twitter feeds)
• Master-Master: Master-Master: All can write (collaborative editing)
• Multi-region: Multi-region: Copies in different countries (Netflix, Spotify)

Benefits: High availability, faster reads, disaster recovery

Indexing: The Secret to Speed

Like a book index - jump directly to what you need instead of reading every page!

Think of it like this:

❌ Without index: Check EVERY row to find 'John' (slow!)
✅ With index: Jump directly to 'John' (lightning fast!)

Common index types:

• B-Tree: Most common, great for ranges
• Hash: Super fast for exact matches
• Full-text: Search within text (like Google)

⚖️ Trade-off: Faster reads, slower writes (need to update index)

🎯 Normalization vs Denormalization

Normalization (Remove Duplicates)

Store data once, reference it everywhere

Pros: No duplicate data, easier updates, saves space

Cons: Requires JOINs (slower reads)

Example: User table + Orders table (link by user_id)

Denormalization (Accept Duplicates)

Store copies of data to avoid JOINs

Pros: Faster reads, no complex JOINs

Cons: Takes more space, updates are harder

Example: Store username in every order (no JOIN needed)

🌍 Real-World Examples

Facebook: MySQL for Social Graph

• Uses MySQL (SQL) for structured data: users, friends, posts
• Massively sharded - thousands of database servers
• Master-slave replication for read scaling
• Custom tools: Vitess for managing shards

💡 Why SQL? Complex relationships between users, posts, and comments require ACID

Twitter: Redis for Timeline Caching

• Uses Redis (Key-Value) for ultra-fast timeline reads
• Each user's timeline cached as a list in Redis
• MySQL stores permanent data, Redis for hot data
• Fan-out on write: when someone tweets, push to followers' caches

💡 Why Redis? Need to serve millions of timeline requests per second

Uber: PostgreSQL + MySQL

• PostgreSQL with PostGIS for location data (trip routes, driver locations)
• MySQL for transactional data (payments, user accounts)
• Heavy use of read replicas for scaling
• Sharded by city/region for better performance

💡 Why PostgreSQL? Excellent geospatial support for location-based features

Netflix: Cassandra for Viewing History

• Cassandra (Column-family) for massive write scalability
• Stores billions of viewing events (play, pause, resume)
• Multi-region replication for global availability
• Eventual consistency is acceptable (don't need instant accuracy)

💡 Why Cassandra? Handles billions of writes per day with high availability

🤔 When to Use What?

Use SQL When:

✓ Data has clear structure and relationships
✓ Need ACID transactions (banking, e-commerce)
✓ Complex queries with JOINs and aggregations
✓ Data integrity is critical

Use NoSQL When:

✓ Flexible/evolving schema needed
✓ Massive scale (billions of records)
✓ High throughput reads/writes
✓ Eventually consistent is okay

Use Redis When:

✓ Need caching layer (sessions, API responses)
✓ Real-time features (leaderboards, counters)
✓ Pub/sub messaging
✓ Sub-millisecond latency required

Use Graph DB When:

✓ Data is highly connected (social networks)
✓ Need to traverse relationships
✓ Recommendation engines
✓ Fraud detection patterns

✅ Best Practices

1.Index Wisely: Add indexes on columns you query often, but don't overdo it

2.Use Connection Pooling: Reuse database connections instead of creating new ones

3.Cache Aggressively: Use Redis/Memcached for frequently accessed data

4.Monitor Performance: Track slow queries and optimize them

5.Plan for Failure: Use replication and backups

6.Choose Right Data Type: Use the smallest type that fits your data

7.Batch Operations: Group multiple writes into single transaction

8.Avoid N+1 Queries: Load related data in advance (eager loading)

9.Use Read Replicas: Distribute read load across multiple servers

10.Regular Maintenance: Vacuum, analyze, and optimize tables

🔑 Key Takeaways

🔑No 'best' database - each has trade-offs
🔑Most large systems use MULTIPLE database types
🔑CAP theorem forces choices in distributed systems
🔑Sharding and replication are key to scaling
🔑Indexes make reads fast but writes slower
🔑Denormalization trades space for speed

💡 Interview Tips

💡Always ask about data size and traffic patterns first
💡Consider both read and write patterns separately
💡Discuss trade-offs: consistency vs availability vs performance
💡Mention specific technologies (PostgreSQL, Redis, Cassandra)
💡Think about evolution: start simple, scale later
💡Don't forget about backups and disaster recovery