Databases in System Design
📚 Explain Like I'm 5...
Imagine a HUGE library with millions of books! 📖 How do you organize them so everyone can find what they need quickly?
🏛️ The Library System:
- 📖 Books are organized by category (Fiction, Science, History)
- 🔢 Each book has a unique ID number (like a barcode)
- 🗂️ There's an INDEX at the front - like a map to find books fast!
- 📚 Sometimes we make COPIES of popular books so more people can read them at once
- 🏢 We can split books into different buildings if one library gets too full!
🚀 Why Are Databases Important?
- • Store MILLIONS of pieces of information safely
- • Find information in MILLISECONDS - faster than blinking!
- • Keep data safe even if computers crash
- • Let thousands of people use data at the same time
🔄 SQL vs NoSQL: What's the Difference?
SQL Databases (Relational)
Like organized tables in a spreadsheet - everything has a place!
- ✓ Structured data in tables with rows and columns
- ✓ Strong rules (ACID) - no messy data!
- ✓ Great for complex relationships (users, orders, products)
- ✓ Examples: PostgreSQL, MySQL, Oracle, SQL Server
NoSQL Databases (Non-Relational)
Like flexible filing cabinets - adapt to any shape of data!
- ✓ Flexible structure - store JSON, documents, key-values
- ✓ Super fast for simple queries
- ✓ Scales easily across many servers
- ✓ Examples: MongoDB, Redis, Cassandra, DynamoDB
📊 Quick Comparison
| SQL | NoSQL | |
|---|---|---|
| Schema | Fixed (must define structure first) | Flexible (can change anytime) |
| Scaling | Vertical (bigger servers) | Horizontal (more servers) |
| Transactions | Strong ACID guarantees | Eventually consistent |
| Queries | Complex joins, aggregations | Simple, fast lookups |
| Best For | Financial systems, e-commerce | Social media, real-time apps |
⚛️ ACID Properties: The Gold Standard
ACID makes sure your data stays safe and consistent, like a bank vault!
A - Atomicity (All or Nothing)
If you transfer $100, BOTH the debit AND credit must happen. If one fails, both fail!
Example: Bank transfer - money can't disappear!
C - Consistency (Rules Always Apply)
Your database must follow all rules. No negative bank balances allowed!
Example: Account balance must always be valid
I - Isolation (Don't Interfere)
Two people withdrawing money at the same time don't mess up the balance
Example: Concurrent transactions don't see each other's incomplete work
D - Durability (Never Lose Data)
Once saved, data stays saved even if power goes out!
Example: Committed transactions survive crashes
🔺 CAP Theorem: Pick Two!
In distributed systems, you can only guarantee TWO of these THREE things:
Consistency
Everyone sees the same data at the same time
Availability
System always responds (even if data is slightly old)
Partition Tolerance
System works even if network has problems
The Trade-off:
- • CP (Consistency + Partition): Bank systems - better to be unavailable than wrong!
- • AP (Availability + Partition): Social media - show slightly old data, but stay online!
- • CA (Consistency + Availability): Single server only - no network issues!
🗄️ Types of Databases
Relational (SQL)
Tables with relationships. Best for structured data with complex queries.
Use when: Banking, e-commerce, CRM systems
PostgreSQL, MySQL, Oracle
Document Store
Store JSON-like documents. Flexible schema, easy to scale.
Use when: Content management, user profiles, catalogs
MongoDB, CouchDB, Firestore
Key-Value Store
Super simple: key → value. Extremely fast for lookups.
Use when: Caching, session storage, shopping carts
Redis, Memcached, DynamoDB
Column-Family Store
Wide columns, great for time-series and analytics.
Use when: Analytics, time-series, IoT data
Cassandra, HBase, BigTable
Graph Database
Store relationships as first-class citizens. Perfect for connected data.
Use when: Social networks, fraud detection, recommendations
Neo4j, Amazon Neptune
📈 Scaling Strategies
Database Sharding
Split your database into smaller pieces (shards) across multiple servers
How it works:
- 1. Horizontal partitioning - divide data by rows
- 2. Each shard holds a subset of data (e.g., users A-M on server 1, N-Z on server 2)
- 3. Distribute load and storage across servers
Database Replication
Make copies of your database for reliability and speed
- • Master-Slave: Master-Slave: One writes, many read (Instagram, Twitter feeds)
- • Master-Master: Master-Master: All can write (collaborative editing)
- • Multi-region: Multi-region: Copies in different countries (Netflix, Spotify)
Benefits: High availability, faster reads, disaster recovery
Indexing: The Secret to Speed
Like a book index - jump directly to what you need instead of reading every page!
Think of it like this:
- ❌ Without index: Check EVERY row to find 'John' (slow!)
- ✅ With index: Jump directly to 'John' (lightning fast!)
Common index types:
- • B-Tree: Most common, great for ranges
- • Hash: Super fast for exact matches
- • Full-text: Search within text (like Google)
⚖️ Trade-off: Faster reads, slower writes (need to update index)
🎯 Normalization vs Denormalization
Normalization (Remove Duplicates)
Store data once, reference it everywhere
Pros: No duplicate data, easier updates, saves space
Cons: Requires JOINs (slower reads)
Example: User table + Orders table (link by user_id)
Denormalization (Accept Duplicates)
Store copies of data to avoid JOINs
Pros: Faster reads, no complex JOINs
Cons: Takes more space, updates are harder
Example: Store username in every order (no JOIN needed)
🌍 Real-World Examples
Facebook: MySQL for Social Graph
- • Uses MySQL (SQL) for structured data: users, friends, posts
- • Massively sharded - thousands of database servers
- • Master-slave replication for read scaling
- • Custom tools: Vitess for managing shards
💡 Why SQL? Complex relationships between users, posts, and comments require ACID
Twitter: Redis for Timeline Caching
- • Uses Redis (Key-Value) for ultra-fast timeline reads
- • Each user's timeline cached as a list in Redis
- • MySQL stores permanent data, Redis for hot data
- • Fan-out on write: when someone tweets, push to followers' caches
💡 Why Redis? Need to serve millions of timeline requests per second
Uber: PostgreSQL + MySQL
- • PostgreSQL with PostGIS for location data (trip routes, driver locations)
- • MySQL for transactional data (payments, user accounts)
- • Heavy use of read replicas for scaling
- • Sharded by city/region for better performance
💡 Why PostgreSQL? Excellent geospatial support for location-based features
Netflix: Cassandra for Viewing History
- • Cassandra (Column-family) for massive write scalability
- • Stores billions of viewing events (play, pause, resume)
- • Multi-region replication for global availability
- • Eventual consistency is acceptable (don't need instant accuracy)
💡 Why Cassandra? Handles billions of writes per day with high availability
🤔 When to Use What?
Use SQL When:
- ✓ Data has clear structure and relationships
- ✓ Need ACID transactions (banking, e-commerce)
- ✓ Complex queries with JOINs and aggregations
- ✓ Data integrity is critical
Use NoSQL When:
- ✓ Flexible/evolving schema needed
- ✓ Massive scale (billions of records)
- ✓ High throughput reads/writes
- ✓ Eventually consistent is okay
Use Redis When:
- ✓ Need caching layer (sessions, API responses)
- ✓ Real-time features (leaderboards, counters)
- ✓ Pub/sub messaging
- ✓ Sub-millisecond latency required
Use Graph DB When:
- ✓ Data is highly connected (social networks)
- ✓ Need to traverse relationships
- ✓ Recommendation engines
- ✓ Fraud detection patterns
✅ Best Practices
🔑 Key Takeaways
- 🔑No 'best' database - each has trade-offs
- 🔑Most large systems use MULTIPLE database types
- 🔑CAP theorem forces choices in distributed systems
- 🔑Sharding and replication are key to scaling
- 🔑Indexes make reads fast but writes slower
- 🔑Denormalization trades space for speed
💡 Interview Tips
- 💡Always ask about data size and traffic patterns first
- 💡Consider both read and write patterns separately
- 💡Discuss trade-offs: consistency vs availability vs performance
- 💡Mention specific technologies (PostgreSQL, Redis, Cassandra)
- 💡Think about evolution: start simple, scale later
- 💡Don't forget about backups and disaster recovery