Master Notebook:
Database Systems

The definitive interview preparation guide covering every database concept tested at FAANG & Big Tech companies — from storage engines to distributed transactions at scale.

13 Modules • 26 Diagrams • 50+ Interview Q&A • Complete Cheat Sheet

Module 1

Database Fundamentals

Core properties, theorems, and classification — the foundation every database discussion builds upon.

ACID Properties

ACID is the gold standard for transactional databases. Every relational database guarantees these four properties for each transaction.

Atomicity

A transaction is an indivisible unit — it either completes entirely or has no effect at all. If any part of a transaction fails, the entire transaction is rolled back. The database uses the Write-Ahead Log (WAL) to achieve this: before modifying any data page, the change is first recorded in the WAL. On crash, the WAL is replayed to undo incomplete transactions.

Consistency

A transaction moves the database from one valid state to another. All constraints (PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, NOT NULL) must be satisfied before and after the transaction. This is enforced by the database engine at commit time.

Isolation

Concurrent transactions execute as if they were serial. In practice, databases offer multiple isolation levels (from Read Uncommitted to Serializable) trading off consistency for performance. Isolation is implemented through locks (pessimistic) or MVCC (optimistic).

Durability

Once a transaction commits, its changes persist even through power failures or crashes. Achieved by flushing the WAL to disk before acknowledging the commit. The data pages can be written lazily via background checkpointing.

-- Example: ACID transaction in PostgreSQL
BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;
  -- If either UPDATE fails, both are rolled back (Atomicity)
  -- Constraints checked at commit (Consistency)
  -- Other transactions see old OR new balances, never partial (Isolation)
  -- Once committed, changes survive crashes (Durability)
COMMIT;

BASE Properties

BASE is the alternative consistency model favored by many distributed NoSQL databases. It trades strong consistency for availability and partition tolerance.

  • Basically Available — The system guarantees availability as defined by the CAP theorem. Every request receives a response (not an error), though the response may contain stale data.
  • Soft state — The state of the system may change over time, even without new input, due to eventual consistency mechanisms like anti-entropy and read repair propagating updates.
  • Eventually consistent — Given enough time without new updates, all replicas will converge to the same value. The convergence window depends on the system — it could be milliseconds or seconds.

ACID vs BASE Comparison

PropertyACIDBASE
Consistency modelStrong consistencyEventual consistency
AvailabilityMay sacrifice during partitionsPrioritizes availability
FocusCorrectnessPerformance & availability
Scaling approachVertical (scale up)Horizontal (scale out)
TransactionsFull ACID transactionsRelaxed, eventual
Use casesBanking, inventory, bookingSocial feeds, analytics, IoT
ExamplesPostgreSQL, MySQL, OracleCassandra, DynamoDB, Riak

CAP Theorem

Proposed by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002, the CAP theorem states that a distributed data store can provide at most two of three guarantees simultaneously:

  • Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time.
  • Availability (A) — Every request receives a non-error response, without the guarantee that it contains the most recent write.
  • Partition Tolerance (P) — The system continues to operate despite network partitions (messages being dropped or delayed between nodes).

In practice, network partitions are inevitable in distributed systems, so the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance). A "CA" system only exists in a non-distributed context (single node).

Consistency Availability Partition Tolerance CA CP AP PostgreSQL (CA*) MongoDB (CP) Cassandra (AP) HBase (CP) DynamoDB (AP) *CA only meaningful for single-node / non-distributed systems
Figure 1.1 — CAP Theorem triangle with database classification
Gotcha: CAP describes behavior DURING a partition, not during normal operation. During normal operation (no partitions), a system can provide both consistency and availability. CAP only forces a choice when a network partition actually occurs. MongoDB, for instance, is fully consistent and available when there is no partition — but during a partition, it chooses consistency (CP) by making the minority partition unavailable.

PACELC Extension

Daniel Abadi proposed PACELC (2012) to address CAP's limitation — it says nothing about system behavior when there is no partition. PACELC adds the tradeoff during normal operation:

If Partition (P), choose Availability (A) or Consistency (C); Else (E), choose Latency (L) or Consistency (C).

DatabaseDuring Partition (PAC)Else (ELC)Classification
PostgreSQLPC (consistency)EC (consistency)PC/EC
CassandraPA (availability)EL (latency)PA/EL
MongoDBPC (consistency)EC (consistency)PC/EC
DynamoDBPA (availability)EL (latency)PA/EL
Cosmos DBPA (availability)EL (latency)PA/EL (tunable)
SpannerPC (consistency)EC (consistency)PC/EC

Database Types

Relational (SQL)

Stores data in tables with rows and columns, enforces schemas, supports JOIN operations, and provides ACID transactions. Best for structured data with complex relationships. Examples: PostgreSQL, MySQL, Oracle, SQL Server.

Document

Stores semi-structured data as JSON/BSON documents. Each document can have a different structure, enabling schema flexibility. Best for content management, user profiles, catalogs. Examples: MongoDB, CouchDB, Firestore.

Key-Value

Simplest NoSQL model — stores data as key-value pairs. Extremely fast for point lookups. Best for caching, session management, feature flags. Examples: Redis, Memcached, DynamoDB, etcd.

Wide-Column (Column-Family)

Stores data in column families with flexible columns per row. Optimized for write-heavy workloads and time-series data. Best for IoT, logging, analytics. Examples: Cassandra, HBase, ScyllaDB.

Graph

Stores data as nodes and relationships (edges) with properties on both. Uses index-free adjacency for O(1) traversal. Best for social networks, fraud detection, recommendations. Examples: Neo4j, Amazon Neptune, TigerGraph.

Time-Series

Optimized for time-stamped data with high ingestion rates, automatic compression, and downsampling. Best for monitoring, metrics, IoT sensor data. Examples: TimescaleDB, InfluxDB, Prometheus.

Vector

Stores high-dimensional vectors for similarity search using algorithms like HNSW or IVFFlat. Best for AI/ML embeddings, semantic search, recommendation engines. Examples: Pinecone, Milvus, pgvector, Weaviate.

Relational id | name | age 1 | Alice | 30 2 | Bob | 25 3 | Carol | 28 Tables + Rows + Columns Document { "name": "Alice", "age": 30, "hobbies": [...] } JSON/BSON Documents Key-Value user:1 → "Alice" sess:x → {token} cfg:db → "pg:5432" Simple Key → Value Graph A B C Nodes + Edges
Figure 1.2 — Data model comparison: same data represented across different database types
DBMS Execution Stack — parser → planner → executor → storage
Module 2

Relational Model & SQL

Normalization, joins, subqueries, window functions, and query optimization — the bread and butter of database interviews.

Normalization

Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity. Each normal form builds on the previous one.

First Normal Form (1NF)

Each column contains only atomic (indivisible) values. No repeating groups or arrays.

-- Violation of 1NF: multi-valued column
-- | id | name  | phones              |
-- | 1  | Alice | 555-1234, 555-5678  |

-- Fixed (1NF compliant):
CREATE TABLE contacts (
  id    INT,
  name  VARCHAR(100),
  phone VARCHAR(20)
);
-- | id | name  | phone    |
-- | 1  | Alice | 555-1234 |
-- | 1  | Alice | 555-5678 |

Second Normal Form (2NF)

Must be in 1NF, and every non-key attribute must depend on the entire primary key (no partial dependencies). Only relevant for composite primary keys.

-- Violation of 2NF: student_name depends only on student_id, not the full PK
-- PK: (student_id, course_id)
-- | student_id | course_id | student_name | grade |
-- student_name depends only on student_id (partial dependency)

-- Fixed: separate the partial dependency
CREATE TABLE students (student_id INT PRIMARY KEY, student_name VARCHAR(100));
CREATE TABLE enrollments (student_id INT, course_id INT, grade CHAR(2),
  PRIMARY KEY (student_id, course_id));

Third Normal Form (3NF)

Must be in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies).

-- Violation of 3NF: city depends on zip_code, not directly on id
-- | id | name  | zip_code | city       |
-- city depends on zip_code (transitive dependency)

-- Fixed: separate the transitive dependency
CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR(100), zip_code VARCHAR(10));
CREATE TABLE zip_codes (zip_code VARCHAR(10) PRIMARY KEY, city VARCHAR(100));

Boyce-Codd Normal Form (BCNF / 3.5NF)

A stricter version of 3NF. Every determinant must be a candidate key. Handles edge cases where 3NF allows certain anomalies when a non-prime attribute is part of a candidate key.

Fourth Normal Form (4NF)

Must be in BCNF with no multi-valued dependencies. If an entity has two or more independent multi-valued facts, they should be in separate tables.

Fifth Normal Form (5NF)

Must be in 4NF with no join dependencies. A table is in 5NF if it cannot be decomposed into smaller tables without loss of data. Rarely needed in practice.

When to Denormalize

  • Read-heavy workloads — Joins are expensive at scale; denormalized data avoids them.
  • Reporting / analytics — Star/snowflake schemas deliberately denormalize for query performance.
  • Caching layers — Pre-computed, denormalized views reduce latency.
  • Microservices — Each service owns its data; cross-service joins are impossible, so denormalization is common.
💡
Interview tip: The most common normalization interview question tests 1NF through 3NF. Know concrete examples for each. For system design, be ready to argue when to denormalize — the answer is usually "when read performance matters more than write consistency."

Joins

SQL joins combine rows from two or more tables based on a related column. Understanding each join type and its performance implications is critical.

INNER JOIN

Returns only rows that have matching values in both tables.

SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;

LEFT (OUTER) JOIN

Returns all rows from the left table, with matching rows from the right table. Unmatched right-side columns are NULL.

SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- Employees without departments will show dept_name as NULL

RIGHT (OUTER) JOIN

Returns all rows from the right table, with matching rows from the left table. Equivalent to swapping tables in a LEFT JOIN.

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match, NULLs fill in the missing side.

CROSS JOIN

Returns the Cartesian product — every row from the left table paired with every row from the right table. If left has M rows and right has N rows, result has M × N rows.

SELF JOIN

Joins a table to itself. Commonly used for hierarchical data (employee → manager) or comparing rows within the same table.

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
INNER LEFT RIGHT FULL OUTER A B A B A B A B
Figure 2.1 — Venn diagram representation of SQL JOIN types
Gotcha: LEFT JOIN with WHERE on right table = INNER JOIN. If you filter on a column from the right table in the WHERE clause (e.g., WHERE d.location = 'NYC'), NULL rows from unmatched left rows are eliminated, effectively converting your LEFT JOIN into an INNER JOIN. Put the condition in the ON clause instead to preserve the LEFT JOIN behavior.

Subqueries & CTEs

Correlated vs Non-Correlated Subqueries

A non-correlated subquery executes once and its result is used by the outer query. A correlated subquery references columns from the outer query and executes once per outer row — much slower for large datasets.

-- Non-correlated: runs once
SELECT name FROM employees
WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC');

-- Correlated: runs once per outer row
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
  SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id
);

EXISTS vs IN

EXISTS short-circuits as soon as a match is found — faster for large subquery results. IN materializes the entire subquery result set first — faster when the subquery returns a small result set.

-- EXISTS: faster when subquery table is large
SELECT name FROM employees e
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.employee_id = e.id);

-- IN: faster when subquery returns few values
SELECT name FROM employees
WHERE dept_id IN (1, 2, 3);

Common Table Expressions (CTEs)

CTEs improve readability and enable recursive queries. They are defined with WITH and can reference themselves for hierarchical data.

-- Basic CTE
WITH dept_stats AS (
  SELECT dept_id, AVG(salary) AS avg_salary, COUNT(*) AS cnt
  FROM employees
  GROUP BY dept_id
)
SELECT d.name, ds.avg_salary, ds.cnt
FROM departments d
JOIN dept_stats ds ON d.id = ds.dept_id;

-- Recursive CTE: org chart traversal
WITH RECURSIVE org_tree AS (
  -- Base case: CEO (no manager)
  SELECT id, name, manager_id, 0 AS depth
  FROM employees WHERE manager_id IS NULL
  UNION ALL
  -- Recursive case: employees with managers
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth;

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them (unlike GROUP BY). They are essential for ranking, running totals, and comparative analytics.

-- ROW_NUMBER, RANK, DENSE_RANK comparison
SELECT name, dept_id, salary,
  ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS row_num,
  RANK()       OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dense_rank
FROM employees;
-- ROW_NUMBER: 1,2,3,4 (always unique)
-- RANK:       1,2,2,4 (gaps after ties)
-- DENSE_RANK: 1,2,2,3 (no gaps after ties)
-- LEAD / LAG: access adjacent rows
SELECT date, revenue,
  LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
  revenue - LAG(revenue, 1) OVER (ORDER BY date) AS daily_change
FROM daily_sales;

-- Running total
SELECT date, amount,
  SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;
-- Top N per group (classic interview question)
WITH ranked AS (
  SELECT name, dept_id, salary,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;

Query Optimization Patterns

The N+1 Query Problem

Occurs when an application executes 1 query to fetch N parent records, then N additional queries to fetch related data for each parent. This results in N+1 total queries instead of 1 or 2.

-- N+1 problem (pseudocode):
-- Query 1: SELECT * FROM orders        → returns 100 orders
-- Query 2-101: SELECT * FROM items WHERE order_id = ?  (×100)
-- Total: 101 queries!

-- Solution: JOIN or batch fetch
SELECT o.*, i.*
FROM orders o
LEFT JOIN order_items i ON o.id = i.order_id;
-- Total: 1 query

-- Or batch with IN:
SELECT * FROM order_items WHERE order_id IN (1,2,3,...,100);
-- Total: 2 queries

Other Optimization Patterns

  • SELECT only needed columns — Avoid SELECT *; it wastes I/O and prevents covering index optimization.
  • Use EXISTS instead of COUNTWHERE EXISTS (...) stops after first match; WHERE (SELECT COUNT(...)) > 0 scans all matching rows.
  • Avoid functions on indexed columnsWHERE YEAR(created_at) = 2024 prevents index use; use range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
  • Use UNION ALL instead of UNION — UNION removes duplicates (requires sort); UNION ALL does not.
  • Paginate with keyset, not OFFSETWHERE id > last_id LIMIT 20 is O(1) vs. OFFSET 10000 LIMIT 20 which is O(n).
Hash Join vs Nested-Loop — same inputs, different access patterns
Module 3

Indexing

B-Trees, B+ Trees, LSM Trees, hash indexes, and composite index strategy — the key to database performance.

B-Tree & B+ Tree

B-Tree

A self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in O(log n) time. Each node can have multiple children (high fan-out), which minimizes tree height and thus disk I/O.

  • Balanced — All leaf nodes are at the same depth, guaranteeing worst-case O(log n) lookups.
  • High fan-out — Each node holds many keys (typically filling one disk page), so tree height is small (3-4 levels for billions of rows).
  • Data in all nodes — Both internal and leaf nodes store actual data pointers.
20 40 5 | 10 | 15 25 | 30 | 35 45 | 50 | 55 1,2,3,4 6,7,8,9 11,12..14 Each node fills one disk page (~8KB). Height 3-4 covers billions of rows. Node split on insertion when node is full
Figure 3.1 — B-Tree structure showing node splits during insertion

B+ Tree

The B+ Tree is a variant where all data is stored only in leaf nodes, and leaf nodes are linked together in a doubly-linked list. This makes range queries extremely efficient — once you find the starting leaf, you follow pointers to traverse consecutive values.

  • Internal nodes store only keys — More keys per internal node = higher fan-out = shorter tree.
  • Leaf linked list — Enables efficient range scans and ORDER BY without re-traversing the tree.
  • All data at leaf level — Every search requires the same number of disk reads (tree height), making performance predictable.
30 10 | 20 40 | 50 1,5,8 10,15,18 20,25,28 30,35,38 40,45,48 Range query: WHERE x BETWEEN 10 AND 28 Follow leaf pointers — no tree re-traversal needed
Figure 3.2 — B+ Tree showing range query traversal via linked leaf nodes
FeatureB-TreeB+ Tree
Data locationInternal + leaf nodesLeaf nodes only
Internal node capacityLower (stores data)Higher (keys only)
Range queriesRequires tree traversalLeaf linked list traversal
Point lookupsCan be faster (data in internal)Always same depth
Used inGeneral purposeMost RDBMS (InnoDB, PostgreSQL)

Hash Index

Hash indexes use a hash function to map keys directly to bucket locations, providing O(1) average-case point lookups. However, they cannot support range queries, ordering, or partial key lookups.

  • Point lookups — Fastest possible: hash the key, go directly to the bucket.
  • No range queriesWHERE price BETWEEN 10 AND 50 cannot use a hash index.
  • No orderingORDER BY cannot use a hash index.
  • No partial matchingWHERE name LIKE 'Al%' cannot use a hash index.
-- PostgreSQL hash index (rarely used in practice, B-tree is usually better)
CREATE INDEX idx_users_email ON users USING hash (email);

-- Good for: exact equality lookups
SELECT * FROM users WHERE email = 'alice@example.com';

-- Cannot use hash index for:
SELECT * FROM users WHERE email > 'a';  -- range
SELECT * FROM users ORDER BY email;     -- ordering

LSM Tree (Log-Structured Merge Tree)

LSM Trees are write-optimized data structures used by write-heavy databases like Cassandra, RocksDB, LevelDB, and HBase. They convert random writes into sequential writes.

How LSM Trees Work

  1. Write to memtable — Incoming writes go to an in-memory sorted data structure (red-black tree or skip list). This is extremely fast.
  2. Flush to SSTable — When the memtable reaches a threshold, it is flushed to disk as an immutable, sorted SSTable (Sorted String Table).
  3. Compaction — Background process merges multiple SSTables, removing duplicates and tombstones, producing larger sorted files.
  4. Read path — Check memtable first, then each SSTable level (using Bloom filters to skip SSTables that definitely do not contain the key).
Write Path Memtable (in-memory, sorted) WAL (durability) flush SSTables on Disk L0-a L0-b L0-c Level 0 L1-a (sorted) L1-b (sorted) Level 1 L2 (larger, sorted, non-overlapping) Level 2 Compaction (merge + sort) Read Path 1. Check Memtable 2. Check Bloom Filters 3. Search L0 SSTables 4. Search L1, L2, ... (Bloom filter skips SSTables that definitely lack the key)
Figure 3.3 — LSM Tree architecture: write path (memtable → SSTable) and compaction flow
FeatureB+ TreeLSM Tree
Write performanceModerate (random I/O)Excellent (sequential I/O)
Read performanceExcellent (single tree)Good (may check multiple levels)
Space amplificationLowHigher (multiple copies during compaction)
Write amplificationLow-moderateHigher (compaction rewrites data)
Range queriesExcellent (leaf list)Good (after compaction)
Used inPostgreSQL, MySQL/InnoDBRocksDB, Cassandra, LevelDB

Specialized Indexes

Inverted Index (Full-Text Search)

Maps terms to the documents/rows containing them. Used by Elasticsearch, PostgreSQL tsvector, and MySQL FULLTEXT. Essential for full-text search.

GIN (Generalized Inverted Index)

PostgreSQL index type for composite values — arrays, JSONB, full-text search (tsvector). Supports operators like @> (contains), && (overlap).

GiST (Generalized Search Tree)

PostgreSQL index for geometric, range, and nearest-neighbor queries. Supports PostGIS spatial queries and range types.

Covering Index (INCLUDE columns)

An index that includes all columns needed by a query, avoiding table lookups entirely. PostgreSQL supports INCLUDE to add non-key columns.

-- Covering index: query answered entirely from index
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (total_amount, status);

-- This query uses only the index (index-only scan):
SELECT order_date, total_amount, status
FROM orders
WHERE customer_id = 42;

Partial Index

An index on a subset of rows, defined with a WHERE clause. Smaller, faster, and ideal for skewed data.

-- Index only active users (5% of total) — much smaller than full index
CREATE INDEX idx_active_users ON users (email)
WHERE is_active = true;

Composite Index & Index Strategy

Leftmost Prefix Rule

A composite index on (A, B, C) can serve queries that filter on:

  • A alone
  • A and B
  • A, B, and C

But it cannot efficiently serve queries on B alone, C alone, or B, C. The leftmost column must always be present.

Index Selection Strategy

  • Selectivity — Index columns that filter out the most rows. A column with 10 distinct values in 10M rows has low selectivity; one with 9M distinct values has high selectivity.
  • Cardinality — Number of distinct values. Higher cardinality = better index candidate.
  • Query patterns — Index columns that appear in WHERE, JOIN ON, ORDER BY, and GROUP BY clauses.
  • Write overhead — Each index slows writes (inserts, updates, deletes). Only create indexes that are actually used.
Gotcha: Composite index (A,B,C) can serve queries on A; A,B; A,B,C — but NOT B alone. The index is sorted by A first, then by B within each A value, then by C within each (A,B) pair. Querying on B alone would require scanning the entire index (equivalent to a full table scan). If you frequently query on B alone, you need a separate index on B.
B+ Tree Range Query — page walk for WHERE age BETWEEN 24 AND 32
Clustered vs Non-Clustered Index Layout
LSM-Tree Compaction — level merges over time
Module 4

Transactions & Concurrency

Isolation levels, MVCC, two-phase locking, and deadlock detection — how databases handle concurrent access safely.

Isolation Levels

SQL defines four standard isolation levels, each allowing progressively fewer concurrency anomalies at the cost of reduced throughput.

Isolation LevelDirty ReadNon-Repeatable ReadPhantom ReadPerformance
Read UncommittedPossiblePossiblePossibleFastest
Read CommittedPreventedPossiblePossibleFast (PostgreSQL default)
Repeatable ReadPreventedPreventedPossibleModerate (MySQL default)
SerializablePreventedPreventedPreventedSlowest
-- Setting isolation level in PostgreSQL
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
  SELECT balance FROM accounts WHERE id = 1;  -- reads 1000
  -- Another transaction updates balance to 500 and commits
  SELECT balance FROM accounts WHERE id = 1;  -- still reads 1000 (snapshot)
COMMIT;

Read Anomalies

Dirty Read

Transaction T2 reads data written by T1 before T1 commits. If T1 rolls back, T2 has read data that never existed.

Non-Repeatable Read

Transaction T1 reads a row, T2 modifies and commits that row, then T1 reads the same row again and gets a different value.

Phantom Read

Transaction T1 reads a set of rows matching a condition, T2 inserts a new row matching that condition and commits, then T1 re-reads and finds the new "phantom" row.

Transaction T1 Transaction T2 Time Dirty Read: T1: UPDATE balance = 500 T2: SELECT balance → 500 (dirty!) T1: ROLLBACK (balance back to 1000) Non-Repeatable Read: T1: SELECT balance → 1000 T2: UPDATE balance=500; COMMIT T1: SELECT balance → 500 (changed!) Phantom Read: T1: SELECT * WHERE age>25 → 3 rows T2: INSERT (age=30); COMMIT T1: SELECT * WHERE age>25 → 4 rows (phantom!)
Figure 4.1 — Timeline diagrams showing dirty read, non-repeatable read, and phantom read anomalies

Multi-Version Concurrency Control (MVCC)

MVCC is the dominant concurrency control mechanism in modern databases. Instead of locking rows, the database keeps multiple versions of each row and serves each transaction the version that was current at the transaction's start time (snapshot).

PostgreSQL MVCC

Each tuple (row version) has two hidden columns: xmin (transaction ID that created it) and xmax (transaction ID that deleted/updated it, or 0 if still alive). When a row is updated, PostgreSQL creates a new tuple with a new xmin and marks the old tuple with the updater's xmax.

MySQL/InnoDB MVCC

InnoDB uses an undo log chain. Each row has a hidden DB_TRX_ID (last modifying transaction) and DB_ROLL_PTR (pointer to the previous version in the undo log). To read an old version, InnoDB walks the undo log chain backwards until it finds a version visible to the reading transaction.

Current Version xmin=103 xmax=0 name='Carol', bal=1500 Previous Version xmin=101 xmax=103 name='Carol', bal=1000 Oldest Version xmin=99 xmax=101 name='Carol', bal=500 Txn 104 (started after 103) sees: bal=1500 Txn 102 (started before 103) sees: bal=1000 Each transaction sees the version that was committed at its start time. Readers never block writers. Writers never block readers.
Figure 4.2 — MVCC version chain: each transaction sees a consistent snapshot
FeatureOptimistic (MVCC)Pessimistic (Locking)
ApproachAllow concurrent access, validate at commitLock resources before access
Readers block writers?NoYes (shared locks block exclusive)
Writers block readers?NoYes (exclusive locks block shared)
Conflicts detectedAt commit timeAt access time
Best whenLow contention, read-heavyHigh contention, write-heavy
OverheadStorage for old versionsLock management, potential deadlocks
Used byPostgreSQL, MySQL/InnoDB, OracleSQL Server (default), DB2

Locking & Two-Phase Locking (2PL)

Two-Phase Locking is a concurrency control protocol that guarantees serializability. Every transaction follows two phases:

  1. Growing phase — Transaction acquires locks but never releases any.
  2. Shrinking phase — Transaction releases locks but never acquires any new ones.

Lock Types

  • Shared (S) lock — For reading. Multiple transactions can hold shared locks on the same resource simultaneously.
  • Exclusive (X) lock — For writing. Only one transaction can hold an exclusive lock. Blocks both shared and exclusive lock requests.
  • Intent locks — Table-level locks that signal the intent to lock rows. IS (Intent Shared), IX (Intent Exclusive). Used to avoid checking every row for conflicting locks.
-- Explicit locking in PostgreSQL
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;      -- Exclusive row lock
SELECT * FROM accounts WHERE id = 1 FOR SHARE;       -- Shared row lock
SELECT * FROM accounts WHERE id = 1 FOR UPDATE SKIP LOCKED; -- Skip already-locked rows
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;    -- Error immediately if locked

Deadlock Detection & Prevention

A deadlock occurs when two or more transactions are waiting for each other to release locks, forming a cycle. No transaction can proceed.

T1 T2 Row A Row B holds waits holds waits DEADLOCK Circular wait detected Prevention strategies: Timeout | Wait-Die | Wound-Wait | Lock ordering
Figure 4.3 — Wait-for graph showing a deadlock cycle between T1 and T2

Detection & Prevention

  • Wait-for graph — Database periodically builds a graph of transaction waits. If a cycle is detected, one transaction (the victim) is aborted and rolled back.
  • Timeout — If a lock wait exceeds a timeout, the waiting transaction is aborted. Simple but can cause unnecessary aborts.
  • Wait-Die — Older transactions wait for younger ones; younger transactions are aborted if they request a lock held by an older one.
  • Wound-Wait — Older transactions "wound" (abort) younger ones holding locks they need; younger transactions wait for older ones.
Gotcha: MySQL Repeatable Read can still allow phantom reads in certain edge cases. MySQL's InnoDB uses "gap locks" and "next-key locks" to prevent phantoms in most cases under Repeatable Read, but gaps in the locking mechanism mean certain complex queries can still experience phantom-like behavior. For guaranteed phantom prevention, use the Serializable isolation level.
Isolation Anomalies Matrix
2PL vs MVCC — two-transaction timeline on row X
Deadlock Wait-For Graph — cycle detection & victim
Module 5

Query Processing

From SQL text to execution plan — parsing, optimization, join algorithms, and EXPLAIN output interpretation.

Query Pipeline

Every SQL query goes through a multi-stage pipeline before any data is returned. Understanding this pipeline is crucial for debugging performance issues.

Parse SQL → AST Analyze Resolve names, types Optimize Cost-based planning Execute Run the plan Return Results to client Lexer + Parser Semantic Analysis Query Optimizer Executor Engine
Figure 5.1 — Query processing pipeline: Parse → Analyze → Optimize → Execute → Return
  1. Parse — The SQL string is tokenized and parsed into an Abstract Syntax Tree (AST). Syntax errors are caught here.
  2. Analyze — Table and column names are resolved against the catalog. Types are checked. Permissions are verified.
  3. Optimize — The query optimizer generates multiple execution plans and estimates the cost of each using table statistics. Selects the cheapest plan.
  4. Execute — The executor runs the chosen plan, pulling data from storage through the buffer pool, applying filters, joins, sorts, and aggregations.

Cost-Based Optimizer

The cost-based optimizer estimates the I/O and CPU cost of different execution plans and chooses the cheapest. It relies on table statistics maintained by ANALYZE.

  • Row count estimates — How many rows will each filter reduce to?
  • Histograms — Distribution of values in each column, allowing accurate selectivity estimation for range predicates.
  • Most common values (MCV) — The N most frequent values and their frequencies, used for equality predicates.
  • Distinct value count (n_distinct) — Estimated number of unique values per column.
  • Correlation — How well the physical order of rows correlates with the logical (index) order. High correlation makes index scans more efficient.
-- Viewing PostgreSQL table statistics
SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  correlation
FROM pg_stats
WHERE tablename = 'orders';

Join Algorithms

The database engine has multiple physical algorithms for executing joins. The optimizer chooses based on table sizes, available indexes, and memory.

Nested Loop Join

For each row in the outer table, scan the inner table for matches. O(M × N) without an index on the inner table. Best when one table is very small or an index exists on the join column of the inner table.

Hash Join

Build a hash table from the smaller table (build phase), then probe it with each row from the larger table (probe phase). O(M + N). Best for large, unsorted tables without indexes. Requires memory for the hash table.

Sort-Merge Join

Sort both tables on the join key, then merge them in a single pass. O(M log M + N log N) for sorting, O(M + N) for merging. Best when both tables are already sorted (or an index provides the sort order) and for range joins.

Nested Loop for row_a in A: for row_b in B: if match: emit O(M x N) or O(M x log N) Best: small outer, indexed inner Hash Join 1. Build hash(small) 2. Probe hash(large) 3. Emit matches O(M + N) Best: large unsorted, equi-join Sort-Merge 1. Sort A on key 2. Sort B on key 3. Merge pass O(M log M + N log N) Best: pre-sorted, range joins
Figure 5.2 — Three join algorithms: Nested Loop, Hash Join, and Sort-Merge Join
AlgorithmComplexityMemoryBest ForRequires
Nested LoopO(M × N)LowSmall outer table, indexed innerNothing (most general)
Hash JoinO(M + N)High (hash table)Large unsorted tables, equi-joinsEquality condition
Sort-MergeO(M log M + N log N)Moderate (sort buffers)Pre-sorted data, range joinsSortable join key

EXPLAIN & ANALYZE

EXPLAIN shows the query plan the optimizer chose. EXPLAIN ANALYZE actually executes the query and shows real timings.

-- PostgreSQL EXPLAIN ANALYZE output
EXPLAIN ANALYZE
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.total > 100;

-- Output:
-- Hash Join  (cost=12.50..45.30 rows=150 width=36)
--           (actual time=0.250..0.890 rows=143 loops=1)
--   Hash Cond: (o.customer_id = c.id)
--   -> Seq Scan on orders o  (cost=0.00..28.50 rows=150 width=12)
--              (actual time=0.010..0.450 rows=143 loops=1)
--        Filter: (total > 100)
--        Rows Removed by Filter: 857
--   -> Hash  (cost=10.00..10.00 rows=200 width=28)
--              (actual time=0.180..0.180 rows=200 loops=1)
--        Buckets: 256  Batches: 1  Memory Usage: 15kB
--        -> Seq Scan on customers c  (cost=0.00..10.00 rows=200 width=28)
--              (actual time=0.005..0.090 rows=200 loops=1)
-- Planning Time: 0.150 ms
-- Execution Time: 1.050 ms

Key Scan Types

  • Seq Scan — Full table scan. Reads every row. Used when no suitable index exists or when a large percentage of rows is needed.
  • Index Scan — Uses B+ Tree index to find matching rows, then fetches row data from heap. Used for selective queries.
  • Index Only Scan — Answers the query entirely from the index without touching the heap table. Fastest for covered queries.
  • Bitmap Index Scan — Scans the index to build a bitmap of matching pages, then fetches those pages. Efficient for medium-selectivity queries.
Gotcha: Fast on small data does not mean fast at scale. A query that runs in 1ms on 1,000 rows might take 30 seconds on 10 million rows because the optimizer chooses a different plan. Always run EXPLAIN ANALYZE on production-sized data or realistic data distributions. The number one cause of production performance issues is testing only on small datasets.
Logical → Physical Plan Tree — SELECT AVG(o.total) FROM users u JOIN orders o ON u.id=o.uid WHERE u.country='US'
Cost-Based Join Order — picking a left-deep plan over A ⋈ B ⋈ C ⋈ D
Module 6

Storage Engine Internals

Pages, buffer pools, Write-Ahead Logging, and the physical reality beneath the SQL abstraction.

Pages & Blocks

Databases do not read or write individual rows. The fundamental unit of I/O is a page (also called a block). Pages are fixed-size units, typically 8KB in PostgreSQL and 16KB in MySQL/InnoDB.

Slotted Page Format

Most databases use a slotted page layout where a page header is followed by a slot array (item pointers) at the beginning, and tuple data is stored from the end of the page growing towards the slot array. Free space exists between the slot array and the tuple data.

Page Header (LSN, checksum, free space pointer, flags) Slot 1 Slot 2 Slot 3 ... FREE SPACE (grows towards each other) slots grow down tuples grow up Tuple 3 Tuple 2 Tuple 1 Typically 8KB (PostgreSQL) or 16KB (InnoDB)
Figure 6.1 — Slotted page layout: header at top, slot array grows down, tuples grow up

Buffer Pool / Cache

The buffer pool is a large region of shared memory that caches frequently accessed disk pages. It is the most critical performance component — a warm buffer pool can make the difference between microsecond and millisecond response times.

  • Page table — Hash map from (tablespace, file, page#) to buffer pool frame number.
  • Pin count — Tracks how many concurrent operations are using a page. Pinned pages cannot be evicted.
  • Dirty flag — Marks pages that have been modified in memory but not yet written to disk.
  • LRU eviction — When the buffer pool is full and a new page is needed, the least-recently-used unpinned, clean page is evicted. Dirty pages must be flushed first.
Buffer Pool (Shared Memory) Page 5 clean Page 12 dirty Page 3 clean Page 8 dirty Page 1 clean free Page Table Page 1 → Frame 5 Page 3 → Frame 3 Page 5 → Frame 1 Page 8 → Frame 4 Page 12 → Frame 2 (hash map) Disk (data files, tablespace) flush dirty read miss LRU eviction when full | Clock sweep (PostgreSQL) | Midpoint insertion (MySQL)
Figure 6.2 — Buffer pool with page table mapping, dirty page tracking, and disk I/O

Write-Ahead Log (WAL)

The WAL is the mechanism that provides durability (the D in ACID) and atomicity (the A in ACID). The fundamental rule: a log record describing a change must be written to stable storage before the data page itself is written.

Client 1. INSERT WAL Buffer (in memory) LSN: 101, 102, 103 2. fsync WAL on Disk (sequential writes) DURABLE 3. COMMIT ACK Buffer Pool (dirty pages) 4. checkpoint Data Files (random writes) update in memory On Crash Recovery: REDO committed txns UNDO uncommitted txns
Figure 6.3 — WAL sequence: write to WAL buffer, fsync to disk, ACK commit, lazy checkpoint to data files

Storage Engine Comparison

FeatureInnoDB (MySQL)MyISAM (MySQL)
ACID compliantYesNo
Row-level lockingYes (MVCC)Table-level locking only
Foreign keysSupportedNot supported
Crash recoveryWAL (redo log)Manual repair required
Full-text searchYes (since 5.6)Yes
Clustered indexYes (PK is clustered)No (heap table)
Buffer poolYes (innodb_buffer_pool_size)Key cache only (key_buffer_size)
MVCCYesNo
Use caseOLTP, general purposeRead-heavy, legacy

RocksDB & LSM Tree Tradeoffs

RocksDB (used by CockroachDB, TiKV, MyRocks) is built on LSM trees. The three fundamental tradeoffs in LSM storage are:

  • Write amplification — Data is written multiple times during compaction. A write amplification of 10x means every byte written by the application results in 10 bytes written to disk.
  • Read amplification — A point read may need to check multiple SSTable levels. Bloom filters mitigate this.
  • Space amplification — Multiple versions of data exist during compaction, temporarily using more space.
Gotcha: InnoDB secondary index stores the primary key, requiring two lookups. When you query via a secondary index in InnoDB, the engine first traverses the secondary B+ Tree to find the primary key value, then traverses the primary key (clustered) B+ Tree to find the actual row data. This "bookmark lookup" doubles the I/O for non-covering secondary index queries. This is why covering indexes (that include all needed columns) are especially valuable in InnoDB.
Row-store (HEAP) vs Column-store Page Layout
WAL Write → Checkpoint → Crash Recovery Timeline
Module 7

Replication

Single-leader, multi-leader, leaderless replication, and consensus protocols — how databases stay in sync.

Single-Leader (Master-Replica) Replication

One node (the leader/primary) accepts all writes. Replicas (followers/secondaries) receive a stream of changes and apply them. All reads can go to any node, but writes only go to the leader.

Synchronous vs Asynchronous

  • Synchronous — Leader waits for at least one replica to confirm it has written the change before acknowledging the client. Guarantees the replica has the data (no data loss on leader failure), but adds latency to every write.
  • Asynchronous — Leader acknowledges the client immediately after its own WAL write. Replicas catch up later. Lower write latency, but replicas can lag behind (stale reads) and data can be lost if the leader crashes before replicating.
  • Semi-synchronous — Leader waits for one replica (but not all). MySQL's semi-sync replication guarantees at least one replica has the data.
Leader (accepts writes) WAL → binlog/stream Replica 1 sync (0ms lag) Replica 2 async (50ms lag) Replica 3 Read Client Write Client Replication lag: time between leader write and replica apply
Figure 7.1 — Single-leader replication with sync and async replicas showing replication lag

Multi-Leader Replication

Multiple nodes accept writes, each replicating to the others. Used for multi-datacenter deployments where each datacenter has its own leader. The main challenge is conflict resolution when two leaders modify the same data concurrently.

Conflict Resolution Strategies

  • Last Write Wins (LWW) — Use timestamps to pick the "latest" write. Simple but can silently discard data. Used by Cassandra and DynamoDB.
  • Merge function — Application-specific logic merges conflicting values (e.g., set union for tags).
  • Custom resolution — Conflicts are logged and presented to the application or user for resolution.
  • CRDTs — Conflict-free Replicated Data Types that are mathematically guaranteed to converge (e.g., G-Counter, OR-Set).

Leaderless Replication & Quorum

No single leader — any node can accept writes. Uses quorum reads and writes to ensure consistency. Made famous by Amazon's Dynamo paper (2007), used by Cassandra and Riak.

Quorum Formula

With N replicas, if W + R > N, reads are guaranteed to see the latest write (at least one node in the read quorum overlaps with the write quorum).

  • N=3, W=2, R=2 — Standard quorum. Tolerates 1 node failure for both reads and writes.
  • N=3, W=3, R=1 — Fast reads (only need 1 node), but writes are slow and cannot tolerate any failures.
  • N=3, W=1, R=3 — Fast writes (only need 1 node), but reads are slow.
Quorum Write (W=2, N=3) Client Node 1 ACK Node 2 ACK Node 3 (slow/down) W=2 of N=3 acknowledged Write is considered successful Node 3 catches up via anti-entropy / read repair W + R > N 2 + 2 > 3 → guaranteed overlap → consistency
Figure 7.2 — Quorum write with W=2, N=3: write succeeds when 2 of 3 nodes acknowledge

Consensus Protocols

Raft

Raft (2014, Diego Ongaro) is a consensus protocol designed to be understandable. It decomposes consensus into three sub-problems:

  1. Leader election — Nodes start as followers. If a follower does not hear from the leader within an election timeout, it becomes a candidate and requests votes. A candidate with a majority becomes the new leader.
  2. Log replication — The leader receives client requests, appends them to its log, and replicates log entries to followers. Once a majority confirms, the entry is committed.
  3. Safety — Guarantees that if an entry is committed, it will be present in the logs of all future leaders.
Follower (default state) Candidate (requesting votes) Leader (accepts writes) election timeout majority votes discovers new term discovers leader heartbeats new election Raft state machine: Follower → Candidate → Leader (or back)
Figure 7.3 — Raft consensus state machine: leader election flow
FeatureRaftPaxos
Year20141989 (published 1998)
Design goalUnderstandabilityTheoretical correctness
LeaderStrong leader requiredCan be leaderless (Multi-Paxos uses leader)
Log orderingStrict sequentialCan have gaps
Membership changesJoint consensus (built-in)Complex extension
Used byetcd, CockroachDB, TiKV, ConsulGoogle Chubby, Spanner
Implementation complexityModerateHigh
Gotcha: Async replication means read replicas can return stale data. If a client writes to the leader and immediately reads from a replica, it may not see its own write. Solutions include: (1) read-after-write consistency by routing the client's reads to the leader for recently-written data, (2) monotonic reads by pinning a client to a specific replica, (3) using synchronous replication for critical paths.
Leader-Follower Log Shipping — replication lag visualized
Module 8

Partitioning & Sharding

Horizontal vs vertical partitioning, shard key selection, consistent hashing, and cross-shard query patterns.

Horizontal vs Vertical Partitioning

Horizontal Partitioning (Sharding)

Splits rows across multiple nodes. Each shard holds a subset of the rows but all columns. For example, users with IDs 1-1M on shard 1, IDs 1M-2M on shard 2.

Vertical Partitioning

Splits columns across multiple tables or nodes. Frequently accessed columns stay in one partition, rarely used or large columns (e.g., BLOBs) in another. This is essentially normalization taken to the physical level.

Full Table: users id | name | email | bio | avatar Horizontal (Sharding) Shard 1: Users 1-1M (all columns, subset of rows) Shard 2: Users 1M-2M (all columns, subset of rows) Vertical Partitioning Table A id | name | email (hot columns) Table B id | bio | avatar (cold columns) Horizontal scales to more machines. Vertical improves cache efficiency per query.
Figure 8.1 — Horizontal partitioning (sharding by rows) vs vertical partitioning (splitting by columns)

Partitioning Strategies

Range Partitioning

Assign contiguous ranges of the shard key to each partition. Example: dates (Jan-Mar → shard 1, Apr-Jun → shard 2). Simple, supports range queries, but can create hot spots (e.g., the current month always gets the most writes).

Hash Partitioning

Apply a hash function to the shard key and assign hash ranges to partitions. Distributes data more evenly, but destroys key ordering — range queries must scatter to all partitions.

Shard Key Selection Criteria

  • High cardinality — The shard key should have many distinct values to distribute evenly.
  • Even distribution — Avoid keys where most data maps to one shard (hot spots).
  • Query affinity — Choose a key that keeps related data on the same shard (e.g., tenant_id for multi-tenant apps).
  • Write distribution — Avoid monotonically increasing keys (like auto-increment IDs) with range partitioning — all writes go to the last partition.
Range Partitioning Hash Partitioning A-G H-N O-T HOT SPOT! h%4=0 h%4=1 h%4=2 h%4=3 + Range queries efficient - Prone to hot spots - Uneven distribution risk + Even distribution - No range queries - Resharding is expensive
Figure 8.2 — Range vs hash partitioning: distribution characteristics and tradeoffs

Resharding & Consistent Hashing

When adding or removing shards, data must be redistributed. Naive hash partitioning (hash(key) % N) is catastrophic here — changing N remaps almost every key.

Consistent Hashing

Maps both keys and nodes to a hash ring. Each key is assigned to the first node clockwise from its position on the ring. Adding or removing a node only affects keys between the new/removed node and its predecessor. Used by DynamoDB, Cassandra, and Riak.

Virtual Nodes (vnodes)

Each physical node is represented by many "virtual nodes" on the ring. This improves distribution uniformity and makes rebalancing smoother when nodes are added/removed.

Cross-Shard Queries

Queries that span multiple shards are inherently more complex and slower. Common patterns include:

  • Scatter-Gather — Send the query to all relevant shards, collect partial results, merge them at the coordinator. Used for aggregations and non-shard-key queries.
  • Global indexes — Maintain a secondary index across all shards to enable queries on non-shard-key columns. The index itself must be distributed.
  • Denormalization — Duplicate data across shards to avoid cross-shard joins. Accept write overhead for read performance.
Gotcha: Sharding is a last resort. Before sharding, exhaust these simpler scaling options: (1) read replicas for read scaling, (2) caching for repeated queries, (3) vertical scaling (bigger machine), (4) query optimization and indexing, (5) connection pooling. Sharding introduces operational complexity (cross-shard queries, distributed transactions, resharding), and it is very hard to undo.
Consistent Hashing Ring — node join/leave only moves O(K/N) keys
Module 9

NoSQL Deep Dive

Redis, MongoDB, Cassandra, Neo4j — architecture, data models, and when to choose each.

Redis

Redis is an in-memory data structure server. It is not just a key-value store — it supports rich data structures with atomic operations on each.

Core Data Structures

  • Strings — Binary-safe strings up to 512MB. Supports SET, GET, INCR, APPEND. Can store serialized JSON, counters, or bitmaps.
  • Hashes — Maps of field-value pairs. Think of a hash as a miniature row. HSET user:1 name "Alice" age 30.
  • Lists — Doubly-linked lists. O(1) push/pop from both ends. Used for queues, activity feeds.
  • Sets — Unordered collections of unique strings. Supports UNION, INTERSECT, DIFF. Used for tags, unique visitors.
  • Sorted Sets (ZSets) — Sets where each member has a score. Sorted by score. O(log N) add/remove/lookup. Used for leaderboards, rate limiting, priority queues.
  • Streams — Append-only log structure with consumer groups. Used for event streaming, similar to Kafka topics at smaller scale.

Persistence Options

FeatureRDB SnapshotsAOF (Append-Only File)
MechanismPoint-in-time snapshot (fork + serialize)Log every write operation
Recovery speedFast (load binary dump)Slower (replay log)
Data loss on crashUp to last snapshot intervalAt most 1 second (with fsync=everysec)
File sizeCompactLarger (but supports rewrite/compaction)
CPU impactFork can be expensive with large datasetsMinimal (sequential writes)
RecommendedBackups, disaster recoveryDurability requirement
Redis Server (Single Thread) Event Loop In-Memory Data epoll/kqueue I/O Persistence RDB (fork + dump) AOF (append log) Client 1 Client 2 Client N Disk dump.rdb appendonly.aof 100K+ ops/sec via single-threaded event loop with epoll I/O multiplexing
Figure 9.1 — Redis internals: single-threaded event loop, in-memory data, persistence options

MongoDB

MongoDB is a document database storing data as BSON (Binary JSON) documents. Each document can have a different structure, enabling schema flexibility within a collection.

Key Concepts

  • Document — A JSON-like record (BSON internally). Can contain nested objects and arrays.
  • Collection — A group of documents (analogous to a table). No schema enforcement by default.
  • Replica Set — A group of MongoDB instances maintaining the same data. One primary, multiple secondaries.
  • Aggregation Pipeline — A series of stages ($match, $group, $sort, $project, $lookup) for data transformation and analysis.
// MongoDB aggregation pipeline example
db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
  { $group: {
      _id: "$customer_id",
      total_spent: { $sum: "$amount" },
      order_count: { $sum: 1 }
  }},
  { $sort: { total_spent: -1 } },
  { $limit: 10 }
]);

Cassandra

Apache Cassandra is a wide-column distributed database designed for high write throughput and linear scalability. It uses a masterless (peer-to-peer) architecture with tunable consistency.

Data Model

  • Partition key — Determines which node stores the data. Hashed for even distribution.
  • Clustering key — Determines the sort order of rows within a partition. Enables efficient range queries within a partition.
  • Wide rows — A partition can contain up to 2 billion cells, enabling time-series and event log patterns.
-- Cassandra table with composite primary key
CREATE TABLE sensor_data (
  sensor_id   UUID,
  event_time  TIMESTAMP,
  temperature DOUBLE,
  humidity    DOUBLE,
  PRIMARY KEY (sensor_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- sensor_id = partition key (which node)
-- event_time = clustering key (sort within partition)
Partition: sensor_id = A1B2 event_time (clustering) | temp | humidity 2024-01-15 10:00:00 | 22.5 | 65 2024-01-15 09:00:00 | 21.8 | 63 2024-01-15 08:00:00 | 20.1 | 68 2024-01-14 23:00:00 | 18.3 | 71 ... (sorted by event_time DESC) Partition: sensor_id = C3D4 event_time (clustering) | temp | humidity 2024-01-15 10:30:00 | 25.0 | 55 2024-01-15 09:30:00 | 24.2 | 57 ... Query-first design: model tables around your query patterns, not your entities Fast: WHERE sensor_id=? AND event_time > ? (single partition scan)
Figure 9.2 — Cassandra data layout: partition key determines node, clustering key sorts within partition

Neo4j & Graph Databases

Neo4j uses the property graph model where data is stored as nodes (entities), relationships (edges), and properties on both. Its key advantage is index-free adjacency: each node physically stores pointers to its neighbors, making traversals O(1) per hop regardless of total graph size.

// Cypher query: find friends of friends
MATCH (p:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(fof)
WHERE fof <> p AND NOT (p)-[:FRIENDS_WITH]->(fof)
RETURN fof.name, count(friend) AS mutual_friends
ORDER BY mutual_friends DESC LIMIT 10;

// Cypher query: shortest path
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:KNOWS*]-(b:Person {name: "Bob"})
)
RETURN path;

NoSQL Data Modeling Patterns

Embedding vs Referencing

  • Embedding — Nest related data inside a single document. Faster reads (single query), but can lead to large documents and data duplication. Best for 1:1 and 1:few relationships that are always read together.
  • Referencing — Store related data in separate documents/collections with references (like foreign keys). Avoids duplication but requires multiple queries or application-level joins. Best for 1:many and many:many relationships.

Fan-Out Patterns

  • Fan-out-on-write — When a user posts, push the post to all followers' timelines immediately. High write cost but fast reads (pre-computed timeline). Used by Twitter for most users.
  • Fan-out-on-read — When a user opens their timeline, pull posts from all followed users and merge. Low write cost but slow reads (compute at read time). Used for celebrity users with millions of followers.
FeatureRedisMongoDBCassandraNeo4j
Data modelData structuresDocument (BSON)Wide-columnProperty graph
StorageIn-memoryDisk (WiredTiger)Disk (LSM tree)Disk (native graph)
ConsistencyStrong (single node)Tunable (W concern)Tunable (quorum)ACID
ScalingCluster (hash slots)Sharding (mongos)Peer-to-peer ringCausal cluster
Best forCaching, queues, sessionsContent, catalogs, profilesTime-series, IoT, logsSocial, fraud, recommendations
Query languageCommandsMQL / AggregationCQL (SQL-like)Cypher
Weak atLarge datasets (RAM cost)Multi-doc transactionsAd-hoc queries, joinsLarge-scale analytics
Gotcha: MongoDB is schema-flexible, not schema-less. Without schema validation, you can insert documents with any structure into a collection. In practice, this leads to chaos. Always use MongoDB's $jsonSchema validator to enforce the expected document structure at the database level. In interviews, distinguish between "schemaless" (no enforcement) and "schema-flexible" (schema is not rigid but should still be defined).
CAP Triangle — pick two under a network partition
Module 10

Distributed Databases

2PC, Saga, consistency models, Google Spanner's TrueTime, and vector clocks — coordination across nodes.

Two-Phase Commit (2PC)

2PC is a distributed transaction protocol that ensures all participants either commit or abort a transaction atomically.

Phase 1: Prepare

The coordinator sends a PREPARE message to all participants. Each participant writes the transaction to its WAL (making it durable) and responds with VOTE_COMMIT or VOTE_ABORT.

Phase 2: Commit/Abort

If all participants voted COMMIT, the coordinator sends COMMIT to all. If any voted ABORT, the coordinator sends ABORT to all.

Coordinator Participant A Participant B Participant C PREPARE Phase 1: Prepare — each participant writes to WAL and votes VOTE_COMMIT All voted YES? COMMIT Phase 2: Commit — coordinator sends final decision
Figure 10.1 — Two-Phase Commit sequence: Prepare phase (vote), then Commit phase (final decision)

Three-Phase Commit (3PC)

3PC adds a PRE-COMMIT phase between PREPARE and COMMIT. This allows participants to distinguish between a coordinator crash during Phase 1 (safe to abort) and during Phase 2 (safe to commit). Reduces the blocking window of 2PC but does not fully eliminate it in the presence of network partitions.

Saga Pattern

Sagas are an alternative to distributed transactions for long-running business processes. Instead of locking resources across services, each service executes a local transaction and publishes an event. If a step fails, compensating transactions undo previous steps.

Choreography

Each service listens for events and decides what to do next. No central coordinator. Simple for few steps but hard to debug and monitor as complexity grows.

Orchestration

A central orchestrator service tells each participant what step to execute. Easier to monitor and debug, but the orchestrator is a single point of logic.

// Saga: Order placement (orchestration)
// Step 1: Reserve inventory       → Compensate: Release inventory
// Step 2: Charge payment           → Compensate: Refund payment
// Step 3: Ship order               → Compensate: Cancel shipment
//
// If Step 2 fails:
//   → Run compensate for Step 1 (release inventory)
//   → Order marked as failed

Consistency Models

ModelGuaranteeStrengthExample
LinearizabilityOps appear instantaneous, in real-time orderStrongestSpanner (external consistency)
Sequential consistencyOps appear in SOME total order consistent with per-process orderStrongZooKeeper
Causal consistencyCausally related ops seen in order; concurrent ops may differModerateMongoDB (causal sessions)
Eventual consistencyAll replicas converge given enough timeWeakestCassandra (ONE/ONE), DynamoDB

Linearizability vs Serializability:

  • Linearizability is a recency guarantee — once a write is acknowledged, all subsequent reads see it. It applies to single operations on single objects.
  • Serializability is a transaction guarantee — the result of executing concurrent transactions is equivalent to some serial order. It applies to groups of operations.
  • Strict serializability (= "serializable + linearizable") provides both. Google Spanner achieves this via TrueTime.

Google Spanner & TrueTime

Google Spanner is a globally distributed, strongly consistent relational database. Its key innovation is TrueTime — an API that provides bounded clock uncertainty using GPS receivers and atomic clocks in every data center.

TrueTime API

TT.now() returns an interval [earliest, latest] — the true time is guaranteed to be within this interval. Spanner waits out the uncertainty window before committing, ensuring that if transaction T1 commits before T2 starts (in real time), then T1's timestamp is less than T2's timestamp.

TrueTime Commit Protocol Transaction T1 wait T1 commit Transaction T2 epsilon (uncertainty) typically < 7ms (GPS + atomic clocks) Guarantee: ts(T1) < ts(T2) if T1 committed before T2 started
Figure 10.2 — TrueTime commit protocol: Spanner waits out clock uncertainty to ensure global ordering

CockroachDB: Hybrid Logical Clocks

CockroachDB cannot use GPS/atomic clocks (it runs on commodity hardware), so it uses Hybrid Logical Clocks (HLC) that combine physical timestamps with logical counters. HLCs provide causal ordering but not the external consistency of TrueTime. CockroachDB compensates with a "clockless reads" mode for strict serializability.

Vector Clocks

Vector clocks track causality in distributed systems. Each node maintains a vector of counters (one per node). When a node sends a message, it increments its own counter and attaches the vector. The receiver merges vectors by taking the element-wise max.

Node A Node B Node C [1,0,0] [2,1,0] [3,1,2] [1,1,0] [2,2,0] [0,0,1] [0,0,2]
Figure 10.3 — Vector clock progression: each node increments its own counter and merges on receive

Vector clocks can detect concurrent events (neither causally precedes the other) vs causally ordered events. DynamoDB (original Dynamo paper) used vector clocks for conflict detection, though the production system now uses Last-Write-Wins for simplicity.

Gotcha: 2PC is a blocking protocol. If the coordinator fails after sending PREPARE but before sending COMMIT, all participants are stuck holding locks, waiting for the coordinator's decision. They cannot safely commit (maybe the coordinator decided to abort) or abort (maybe it decided to commit). This is the fundamental limitation of 2PC — solved by 3PC (partially) and Paxos/Raft-based commit protocols (fully).
Raft Leader Election + Log Replication across 5 nodes
Module 11

Caching

Caching strategies, Redis internals, cache invalidation patterns, and failure modes — the art of keeping hot data close.

Caching Strategies

There are five fundamental caching strategies, each with different consistency and performance tradeoffs. Choosing the right one depends on your read/write ratio, consistency requirements, and tolerance for stale data.

1. Cache-Aside (Lazy Loading)

The application manages the cache directly. On read miss, fetch from DB, populate cache, return. On write, update DB, then invalidate cache entry.

def get_user(user_id):
    # 1. Check cache first
    user = cache.get(f"user:{user_id}")
    if user is not None:
        return user  # Cache hit

    # 2. Cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)

    # 3. Populate cache with TTL
    cache.set(f"user:{user_id}", user, ttl=300)
    return user

def update_user(user_id, data):
    db.update("UPDATE users SET ... WHERE id = %s", data, user_id)
    cache.delete(f"user:{user_id}")  # Invalidate, don't update

Pros: Only requested data is cached (no wasted memory). Cache failure doesn't break the app — just slower. Cons: Cache miss = three round trips. Stale data window between DB write and cache invalidation.

2. Read-Through

The cache itself loads data from the DB on a miss. The application only talks to the cache, never directly to the DB for reads. The cache library handles the DB fetch logic.

Pros: Simpler application code. Cons: Cache library must know how to query DB. First request is always slow (cold cache).

3. Write-Through

Every write goes to the cache first, and the cache synchronously writes to the DB. Data in cache is always consistent with DB.

Pros: Cache is never stale. Reads are always fast. Cons: Write latency increases (two writes per operation). Caches data that may never be read.

4. Write-Behind (Write-Back)

Write to cache immediately. The cache asynchronously flushes writes to the DB in batches. Significantly faster writes, but risk of data loss if cache crashes before flush.

Pros: Very fast writes. Batching reduces DB load. Cons: Data loss risk. Complex failure handling. DB may be behind cache.

5. Write-Around

Write directly to DB, bypassing the cache entirely. The cache is only populated on read misses (via cache-aside). Best for write-heavy workloads where written data is rarely re-read immediately.

Pros: Cache not polluted with write-heavy data. Cons: Recent writes always cause cache misses.

Cache-Aside App Cache DB App manages both Read-Through App Cache DB Cache loads from DB Write-Through App Cache DB Cache writes to DB sync Write-Behind App Cache DB Async flush to DB Write-Around App Cache DB Write bypasses cache
Figure 11.1 — Five caching strategies: data flow between Application, Cache, and Database
StrategyRead PathWrite PathConsistencyBest For
Cache-AsideApp → Cache (miss → DB → fill)App → DB → invalidate cacheEventual (stale window)General purpose, read-heavy
Read-ThroughApp → Cache (auto-loads from DB)App → DB (separate)EventualSimplifying app code
Write-ThroughApp → Cache (always hot)App → Cache → DB (sync)StrongRead-heavy, consistency needed
Write-BehindApp → CacheApp → Cache → DB (async batch)Weak (lag)Write-heavy, high throughput
Write-AroundApp → Cache (miss → DB → fill)App → DB onlyEventualWrite-heavy, rarely re-read

Redis Deep Dive

Redis is a single-threaded, in-memory data structure server. Despite being single-threaded for command execution, it handles 100K+ operations/second because it uses non-blocking I/O multiplexing (epoll on Linux, kqueue on macOS).

Architecture: Single-Threaded Event Loop

Redis processes all commands sequentially on a single thread. This eliminates lock contention and context switching. Network I/O is handled via multiplexing — Redis registers file descriptors with epoll and processes events as they become ready.

Event Loop Single Thread epoll / kqueue Non-blocking I/O Client 1 Client 2 Client 3 Client N In-Memory Store Hash tables, skiplists Persistence RDB / AOF
Figure 11.2 — Redis single-threaded event loop with I/O multiplexing: all clients share one thread

Eviction Policies

When Redis hits its maxmemory limit, it must evict keys. The policy determines which keys are removed:

PolicyScopeAlgorithmUse Case
noevictionN/AReject writesWhen data loss is unacceptable
allkeys-lruAll keysApprox. LRUGeneral cache (most common)
volatile-lruKeys with TTLApprox. LRUMix of persistent + cache keys
allkeys-lfuAll keysApprox. LFUFrequency-based (Redis 4.0+)
allkeys-randomAll keysRandomUniform access pattern
volatile-ttlKeys with TTLShortest TTL firstExpire soon-to-die keys first

Redis uses approximate LRU/LFU — it samples N keys (default 5) and evicts the least recently/frequently used among samples. This is much cheaper than true LRU (no doubly-linked list overhead) and nearly as accurate.

Redis Cluster

Redis Cluster partitions data across multiple nodes using 16,384 hash slots. Each key maps to a slot via CRC16(key) mod 16384. Each master node owns a subset of slots.

  • Gossip protocol — Nodes exchange cluster state via periodic PING/PONG messages (cluster bus on port+10000).
  • Resharding — Slots can be migrated between nodes live. During migration, MOVED and ASK redirects guide clients to the correct node.
  • Failover — If a master fails, its replica is promoted via cluster consensus (majority of masters must agree).
  • Multi-key limitation — Multi-key operations (MGET, pipeline) only work if all keys hash to the same slot. Use hash tags {user:123}.name to force co-location.

Cache Invalidation

Phil Karlton famously said: "There are only two hard things in Computer Science: cache invalidation and naming things." Here are the strategies:

TTL-Based (Time-To-Live)

Set an expiry time on each cache entry. After TTL expires, the next request triggers a fresh fetch. Simple but blunt — data may be stale for the entire TTL window.

cache.set("product:42", product_data, ttl=60)  # Stale for up to 60 seconds

Event-Driven Invalidation

Invalidate cache entries when the underlying data changes. Three approaches:

  • Application-level — Delete cache key after DB write (fragile — can miss code paths).
  • Change Data Capture (CDC) — Tail the database's WAL/binlog (e.g., Debezium for PostgreSQL/MySQL). Guaranteed to capture every change.
  • Webhooks/Events — Publish invalidation events to a message bus (Kafka, RabbitMQ). Cache consumers subscribe and invalidate.

Versioned Keys

Instead of invalidating, increment a version counter and embed it in the cache key:

# Version stored in a separate fast-changing counter
version = cache.incr("product:42:version")
cache.set(f"product:42:v{version}", new_data, ttl=3600)
# Old versions naturally expire via TTL

Write-Invalidate vs Write-Update

  • Write-Invalidate — Delete the cache key on write. Next read re-fetches from DB. Simpler, safer (no race conditions on concurrent writes).
  • Write-Update — Update the cache value on write. Faster subsequent reads, but concurrent writes can leave stale data (last writer wins in cache may differ from DB).

Recommendation: Prefer write-invalidate. Write-update only makes sense if reads vastly outnumber writes AND you can tolerate occasional inconsistency.

Cache Failure Patterns

Three critical failure modes can bring down your database when caching goes wrong:

1. Thundering Herd (Cache Stampede)

A popular cache key expires. Hundreds of concurrent requests all miss the cache simultaneously and hit the database, potentially causing a cascading failure.

Solutions:

  • Mutex/Lock — First request acquires a distributed lock, fetches from DB, populates cache. Other requests wait or return stale data.
  • Request coalescing — Collapse duplicate in-flight requests into one (singleflight pattern in Go).
  • Early refresh — Probabilistically refresh before TTL expiry. At time T, each request has probability P = (now - created) / TTL of triggering a background refresh.

2. Cache Penetration

Requests for data that does not exist in the DB repeatedly bypass the cache (because there is nothing to cache) and hit the database. Can be an attack vector.

Solutions:

  • Cache null results — Store empty/null marker in cache with short TTL (e.g., 30s).
  • Bloom filter — Check a bloom filter before querying. If the key is definitely not in the DB, return immediately without DB hit. False positives are OK (just cause an occasional cache miss).

3. Cache Avalanche

Many cache keys expire at the same time (e.g., all set with the same TTL at startup). The sudden flood of cache misses overwhelms the database.

Solutions:

  • Staggered TTL — Add random jitter to TTL: ttl = base_ttl + random(0, jitter).
  • Warm-up on deploy — Pre-populate cache before routing traffic to new instances.
  • Multi-tier caching — L1 (in-process) + L2 (Redis). L1 has shorter TTL and absorbs some misses.
Thundering Herd Key X expires DB N requests all hit DB at once Fix: mutex / coalescing Cache Penetration Key never exists in DB Cache MISS (always) DB NULL (always) Every request goes to DB for nothing Fix: bloom filter / cache null Cache Avalanche Mass expiry at same time ALL EXPIRE TOGETHER DB Flood of misses overwhelms DB Fix: staggered TTL / jitter
Figure 11.3 — Three cache failure patterns: thundering herd, cache penetration, and cache avalanche

Cache Warming

A cold cache after deployment or restart can cause a sudden spike in database load. Cache warming strategies mitigate this:

  • Pre-load on deploy — Before routing traffic, populate the cache with the most frequently accessed keys. Query your analytics or access logs to determine the hot set.
  • Background refresh — A background worker continuously refreshes soon-to-expire keys, ensuring the cache never goes cold. Useful for keys with predictable access patterns.
  • Materialized cache — Pre-compute and store aggregated results (e.g., "top 100 products") in the cache. Refresh on a schedule rather than on demand.
  • Shadow traffic — Route a copy of production traffic to the new cache instance before cutover, warming it with real access patterns.
Gotcha: Write-behind can lose data. If the cache crashes before flushing pending writes to the database, those writes are lost permanently. Always log write-behind operations to a durable write-ahead log for recovery. Consider using Redis AOF with appendfsync everysec as a compromise between performance and durability.
LRU vs LFU Eviction — cache capacity=3, access stream A B C A B D A E
Module 12

Database for System Design

Connection pooling, read replicas, database proxies, schema migrations, and time-series databases — production database patterns.

Connection Pooling

Every database connection consumes server memory (~10MB in PostgreSQL for the backend process). Opening a connection takes 50-100ms for TCP handshake + TLS + authentication. Connection pooling reuses connections to avoid this overhead.

How Pooling Works

A pool manager (PgBouncer, HikariCP, etc.) maintains a set of pre-established connections. Application threads check out a connection, execute queries, and return it to the pool. The pool handles lifecycle management.

Pool Sizing: Little's Law

The optimal pool size can be derived from Little's Law: L = λW, where L = average connections in use, λ = request rate, W = average query duration.

# Example: 1000 req/sec, avg query time 10ms
L = λ × W = 1000 × 0.01 = 10 connections needed

# HikariCP recommendation (practical formula):
pool_size = (2 × num_cpu_cores) + num_spindle_disks
# For SSD: pool_size = 2 × cores + 1
# A 4-core server with SSD: pool_size = 9

Too few connections: Requests queue up waiting for a free connection. Too many connections: Database server thrashes — each connection consumes memory, and context switching between hundreds of processes degrades throughput.

Pooling Modes

ModeConnection AssignedProsConsUse Case
SessionFor entire client sessionFull feature support (prepared statements, SET, LISTEN)Poor multiplexing — one pool conn per clientLong-lived app connections
TransactionFor one transaction onlyGreat multiplexing — released after COMMITNo session-level features (SET, prepared stmts)Most web apps (recommended)
StatementFor one query onlyMaximum multiplexingNo transactions, no multi-statement supportSimple SELECT-only workloads
💡
Tip: PgBouncer in transaction mode is the most common production setup for PostgreSQL. It can multiplex 10,000 application connections onto 100 database connections. Set server_reset_query = DISCARD ALL to clean up session state between uses.

Read Replicas

Read replicas scale read throughput by distributing queries across multiple copies of the data. The primary handles all writes; replicas receive changes via replication.

Routing Strategies

  • Application-level routing — Application code explicitly routes writes to primary, reads to replica. Simple but requires code changes everywhere.
  • Proxy-based routing — A database proxy (ProxySQL, pgpool-II) intercepts queries and routes based on query type. Transparent to the application.
  • DNS-based routing — Different DNS entries for write endpoint (primary) and read endpoint (round-robin across replicas). Used by AWS RDS.

Handling Replication Lag

Async replicas can lag behind the primary. Three critical patterns to handle this:

  • Read-your-writes — After a write, route subsequent reads from the same user to the primary (or wait for replica to catch up). Implemented via session affinity or a "last write timestamp" cookie.
  • Monotonic reads — Ensure a user always reads from the same replica, so they never see data go "backwards." Pin sessions to a specific replica.
  • Causal consistency — If user A writes, then tells user B to check, B should see A's write. Track causal dependencies using logical timestamps or version vectors.
# Read-your-writes pattern with timestamp tracking
def route_query(user, query, last_write_ts):
    if query.is_write:
        result = primary.execute(query)
        user.last_write_ts = now()
        return result

    # If user wrote recently, read from primary
    if now() - user.last_write_ts < REPLICATION_LAG_THRESHOLD:
        return primary.execute(query)

    # Safe to read from replica
    return replica.execute(query)

Database Proxy

A database proxy sits between application and database servers, providing transparent query routing, load balancing, and connection multiplexing.

Popular Database Proxies

  • ProxySQL — MySQL proxy. Query routing rules, connection pooling, query caching, query rewriting, failover. Used at scale by many companies.
  • Vitess — MySQL sharding middleware (originally YouTube). Horizontal sharding, schema migrations, connection pooling, query routing. Now used by Slack, GitHub, HubSpot.
  • pgpool-II — PostgreSQL proxy. Connection pooling, load balancing, automated failover, parallel query execution.
  • PgBouncer — Lightweight PostgreSQL connection pooler. Does not route queries — purely pooling. Minimal overhead.

Proxy Capabilities

FeatureProxySQLVitessPgBouncerpgpool-II
Connection poolingYesYesYesYes
Read/write splittingYes (rules)YesNoYes
ShardingNoYesNoNo
Query cachingYesRow cacheNoYes
Schema migrationNoYes (online)NoNo
FailoverYesYesNoYes
DatabaseMySQLMySQLPostgreSQLPostgreSQL

Schema Migration

Changing the schema of a production database with billions of rows and zero downtime is one of the hardest operational challenges. The naive approach (ALTER TABLE) can lock the table for hours.

Online DDL: gh-ost Approach

GitHub's gh-ost performs online schema changes without triggers:

  1. Create shadow table — A new table with the desired schema.
  2. Copy existing data — Bulk-copy rows from the original table to the shadow table in batches (throttled to not overload the server).
  3. Tail binary log — Simultaneously capture all ongoing DML (INSERT/UPDATE/DELETE) from the MySQL binlog and apply them to the shadow table.
  4. Cut-over — When the shadow table is caught up, atomically rename tables: original → _old, shadow → original.
  5. Cleanup — Drop the old table after verification.

Expand-Contract Pattern (Backward-Compatible Migrations)

For zero-downtime deploys, schema changes must be backward compatible with both old and new application code running simultaneously:

  1. Expand — Add new column (nullable or with default). Deploy app code that writes to BOTH old and new columns. Backfill existing rows.
  2. Migrate — Deploy app code that reads from new column. Verify correctness.
  3. Contract — Remove old column. Deploy code that only uses new column.
-- Step 1: Expand — add new column (safe, no lock)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Step 2: Backfill (in batches to avoid long transactions)
UPDATE users SET full_name = first_name || ' ' || last_name
WHERE id BETWEEN 1 AND 10000;

-- Step 3: Contract — drop old columns (after all code uses full_name)
ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;
Gotcha: CREATE INDEX on a 100M-row table locks writes. Standard CREATE INDEX in PostgreSQL acquires a SHARE lock on the table, blocking all INSERT/UPDATE/DELETE for the entire build duration (could be hours). Always use CREATE INDEX CONCURRENTLY — it takes 2-3x longer but allows concurrent writes. For MySQL, use pt-online-schema-change or gh-ost.

Time-Series Databases

Time-series data — metrics, IoT sensor readings, financial ticks, logs — has unique characteristics: high write volume, append-mostly, time-ordered queries, natural aging (old data less valuable).

TimescaleDB

TimescaleDB is a PostgreSQL extension that adds time-series optimizations while keeping full SQL compatibility:

  • Hypertables — Automatic time-based partitioning. A hypertable looks like one table but is internally partitioned into chunks (default: 1 week per chunk). Queries touching recent data only scan recent chunks.
  • Compression — Native columnar compression on old chunks (10-20x compression ratio). Compressed chunks are read-only but query-able.
  • Continuous aggregates — Materialized views that automatically refresh as new data arrives. Pre-compute rollups (hourly/daily averages) without manual cron jobs.
  • Data retention — Automated policies to drop or compress chunks older than N days. Much faster than DELETE (just drop the underlying table).
-- Create a hypertable (TimescaleDB)
CREATE TABLE metrics (
    time        TIMESTAMPTZ NOT NULL,
    device_id   INTEGER,
    temperature DOUBLE PRECISION,
    humidity    DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'time');

-- Continuous aggregate for hourly averages
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS bucket,
    device_id,
    AVG(temperature) AS avg_temp,
    MAX(temperature) AS max_temp
FROM metrics
GROUP BY bucket, device_id;

-- Retention policy: drop data older than 90 days
SELECT add_retention_policy('metrics', INTERVAL '90 days');

-- Compression policy: compress chunks older than 7 days
ALTER TABLE metrics SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id'
);
SELECT add_compression_policy('metrics', INTERVAL '7 days');

Time-Series DB Comparison

FeatureTimescaleDBInfluxDBPrometheusClickHouse
Query languageFull SQLInfluxQL / FluxPromQLSQL dialect
Storage enginePostgreSQL + chunksTSM (custom LSM)TSDB (custom)MergeTree (columnar)
CompressionNative columnarRun-length, deltaGorilla encodingLZ4, ZSTD, delta
Best forIoT, metrics + relationalMetrics, DevOpsMonitoring, alertingAnalytics, OLAP
JoinsFull SQL joinsLimitedNoYes
ScalingSingle-node + replicasCluster (enterprise)Federation, ThanosDistributed
"Choose Your Database" Decision Tree
Module 13

Cheat Sheet

Module summaries, rapid-fire Q&A, SQL quick reference, and database selection guide — your interview day companion.

Module Summary Cards

1. Database Fundamentals

  • ACID: Atomicity, Consistency, Isolation, Durability
  • BASE: Basically Available, Soft state, Eventually consistent
  • CAP: Pick 2 of Consistency, Availability, Partition tolerance
  • PACELC: If P then A/C else L/C tradeoff
  • Types: Relational, Document, Key-Value, Wide-Column, Graph, Time-Series

2. Relational Model & SQL

  • Normal forms: 1NF (atomic) → 2NF (no partial deps) → 3NF (no transitive deps) → BCNF
  • Joins: INNER, LEFT, RIGHT, FULL, CROSS, SELF
  • Window functions: ROW_NUMBER, RANK, DENSE_RANK, LEAD/LAG, SUM OVER
  • CTEs with WITH clause; recursive CTEs for hierarchies
  • Denormalize for read-heavy, normalize for write-heavy

3. Indexing

  • B+ Tree: O(log n) search, range queries, ordered traversal
  • Hash Index: O(1) exact match, no range queries
  • LSM Tree: O(1) write (append), used by RocksDB, Cassandra
  • Composite index: leftmost prefix rule
  • Covering index: index-only scan, avoids table lookup
  • GIN for full-text, GiST for spatial, BRIN for sorted data

4. Transactions & Concurrency

  • Isolation: Read Uncommitted → Read Committed → Repeatable Read → Serializable
  • Anomalies: dirty read, non-repeatable read, phantom read, write skew
  • MVCC: multiple row versions, snapshot reads without blocking writes
  • 2PL: growing phase (acquire) + shrinking phase (release)
  • Deadlock: waits-for graph cycle detection or timeout

5. Query Processing

  • Pipeline: Parse → Analyze → Optimize → Execute → Return
  • Cost-based optimizer uses pg_statistic, histograms, selectivity
  • Join algorithms: Nested Loop, Hash Join, Sort-Merge Join
  • EXPLAIN ANALYZE: actual vs estimated rows, sequential vs index scan
  • Always benchmark on production-sized data

6. Storage Engine Internals

  • Pages: 8KB fixed-size blocks, slotted page format
  • Buffer pool: page table, LRU/clock-sweep eviction, dirty tracking
  • WAL: write-ahead log for crash recovery, LSN ordering
  • InnoDB: clustered index, MVCC, row-level locking
  • fsync() forces disk flush, O_DIRECT bypasses OS cache

7. Replication

  • Single-leader: one writer, N followers, sync/async/semi-sync
  • Multi-leader: conflict resolution via LWW, CRDTs, custom rules
  • Leaderless: quorum W+R>N, sloppy quorum, hinted handoff
  • Raft: leader election, log replication, safety guarantee
  • Binlog formats: statement-based, row-based, mixed

8. Partitioning & Sharding

  • Horizontal (rows) vs vertical (columns) partitioning
  • Shard key: high cardinality, even distribution, query-aligned
  • Range (ordered, skew risk) vs hash (uniform, no range)
  • Consistent hashing with virtual nodes for resharding
  • Sharding is a last resort — try replicas, caching, optimization first

9. NoSQL Deep Dive

  • Redis: in-memory data structures, 100K+ ops/sec, RDB/AOF persistence
  • MongoDB: BSON documents, aggregation pipeline, change streams
  • Cassandra: wide-column, gossip protocol, LSM tree, linear scaling
  • Neo4j: property graph, Cypher queries, index-free adjacency
  • Modeling: embed for 1:few, reference for 1:many or many:many

10. Distributed Databases

  • 2PC: prepare/commit, blocking if coordinator fails
  • 3PC: adds pre-commit phase, non-blocking under certain failures
  • Saga: compensating transactions for long-lived workflows
  • Spanner: TrueTime API, GPS + atomic clocks, <7ms uncertainty
  • Vector clocks: detect causality and concurrent events

11. Caching

  • Strategies: cache-aside, read-through, write-through, write-behind, write-around
  • Redis: single-threaded, epoll multiplexing, 16384 hash slots
  • Eviction: allkeys-lru (most common), volatile-lru, allkeys-lfu
  • Invalidation: TTL, CDC/events, versioned keys
  • Failures: thundering herd (mutex), penetration (bloom), avalanche (jitter)

12. Database for System Design

  • Pool sizing: L = λW (Little's Law), or 2×cores + spindles
  • PgBouncer: transaction pooling mode for PostgreSQL
  • Read replicas: read-your-writes, monotonic reads patterns
  • Schema migration: gh-ost (binlog tailing), expand-contract
  • CREATE INDEX CONCURRENTLY to avoid write locks

Top 50 Database Rapid-Fire Q&A

#QuestionAnswer
1What does ACID stand for?Atomicity, Consistency, Isolation, Durability
2What does BASE stand for?Basically Available, Soft state, Eventually consistent
3State the CAP theoremA distributed system can guarantee at most 2 of: Consistency, Availability, Partition tolerance
4What is PACELC?If Partition → choose A or C; Else → choose Latency or Consistency
5Name the 4 SQL isolation levelsRead Uncommitted, Read Committed, Repeatable Read, Serializable
6What is a dirty read?Reading uncommitted data from another transaction
7What is a phantom read?New rows appear in a repeated range query due to another transaction's insert
8What is write skew?Two transactions read overlapping data, make decisions based on it, then write — violating a constraint neither saw
9How does MVCC work?Each row has multiple versions; readers see a snapshot without blocking writers
10What is 2PL?Two-Phase Locking: growing phase (acquire locks) + shrinking phase (release locks). Guarantees serializability
11B-Tree vs B+ Tree?B+ Tree stores data only in leaves and links leaves for range scans; B-Tree stores data in all nodes
12Hash index vs B+ Tree?Hash: O(1) exact match but no range queries. B+ Tree: O(log n) but supports range, sorting, prefix
13What is a covering index?An index that contains all columns needed by a query — avoids heap/table lookup entirely
14Leftmost prefix rule?Composite index (a, b, c) can satisfy queries on (a), (a,b), (a,b,c) but NOT (b), (c), or (b,c)
15What is an LSM Tree?Log-Structured Merge Tree: writes go to memtable → flush to sorted SSTables → background compaction
16LSM vs B+ Tree tradeoffs?LSM: faster writes, higher write amplification. B+: faster reads, slower writes, in-place updates
17What is write amplification?Ratio of actual bytes written to disk vs logical bytes written. LSM trees suffer from compaction-related amplification
18What does EXPLAIN ANALYZE show?The actual execution plan with real timing, row counts, and I/O stats — not just estimates
19Sequential scan vs index scan?Seq scan reads entire table. Index scan uses B+ tree to find specific rows. Index scan wins for selective queries (<5-15% of rows)
20When is a sequential scan faster?When reading most of the table (>15% of rows) — the random I/O cost of index lookups exceeds sequential read cost
21Name 3 join algorithmsNested Loop Join, Hash Join, Sort-Merge Join
22When is Hash Join best?Equi-joins on large tables where one side fits in memory (build hash table on smaller side)
23What is a WAL?Write-Ahead Log: all changes written to log before data pages. Enables crash recovery by replaying the log
24Page size in PostgreSQL?8 KB (fixed). Oracle uses 8KB default, MySQL/InnoDB uses 16KB default
25What is the buffer pool?In-memory cache of disk pages. Maps disk pages to memory frames. Uses clock-sweep (Postgres) or LRU (MySQL) for eviction
26InnoDB vs MyISAM?InnoDB: ACID, row locks, MVCC, crash recovery. MyISAM: table locks, no transactions, faster full-text (legacy)
27Sync vs async replication?Sync: leader waits for follower ACK (strong consistency, higher latency). Async: leader doesn't wait (risk of data loss on failover)
28What is a quorum?W + R > N ensures read sees latest write. Common: N=3, W=2, R=2
29Raft vs Paxos?Raft: easier to understand, strong leader. Paxos: more flexible, harder to implement, better liveness
30What is a sloppy quorum?During network partition, accept writes on available nodes (not necessarily the designated replicas). Hinted handoff transfers data back later
31Range vs hash partitioning?Range: preserves order (good for range queries, risk of hotspots). Hash: uniform distribution (no range queries)
32What is consistent hashing?Hash both keys and nodes onto a ring. Keys map to the next clockwise node. Adding/removing nodes only moves K/N keys
33When to shard?Last resort. First try: read replicas, caching, vertical scaling, query optimization, connection pooling
34Redis persistence options?RDB (point-in-time snapshots) and AOF (append-only log of every write). Use both for best durability
35Why is Redis fast?In-memory, single-threaded (no locks), epoll I/O multiplexing, efficient data structures (skiplists, ziplists)
36MongoDB vs PostgreSQL?MongoDB: flexible schema, horizontal scaling, document model. PostgreSQL: ACID, complex queries, joins, mature ecosystem
37Cassandra partition key vs clustering key?Partition key determines which node stores the data. Clustering key determines sort order within the partition
38What is 2PC?Two-Phase Commit: coordinator asks all participants to PREPARE, then sends COMMIT/ABORT. Blocking if coordinator fails
39Saga pattern?Break distributed transaction into local transactions with compensating actions. Choreography (events) or orchestration (coordinator)
40Linearizability vs serializability?Linearizability: real-time ordering of single operations. Serializability: transactions appear to execute sequentially
41What is Google Spanner's TrueTime?API returning time interval [earliest, latest]. Uses GPS + atomic clocks for <7ms uncertainty. Enables globally consistent transactions
425 caching strategies?Cache-Aside, Read-Through, Write-Through, Write-Behind, Write-Around
43Cache-aside vs read-through?Cache-aside: app manages cache. Read-through: cache itself loads from DB on miss
44What is a thundering herd?Popular cache key expires → hundreds of requests simultaneously hit DB. Fix: mutex, request coalescing
45What is cache penetration?Repeated queries for non-existent data bypass cache and hit DB. Fix: bloom filter or cache null results
46Connection pool sizing formula?Little's Law: L = λW. Or HikariCP rule: 2 × cores + disk spindles
47PgBouncer pooling modes?Session (per connection), Transaction (per txn, recommended), Statement (per query)
48What is gh-ost?GitHub's online schema migration tool. Uses binlog tailing instead of triggers. Zero-downtime table alterations
49Why CREATE INDEX CONCURRENTLY?Standard CREATE INDEX locks writes for the entire build. CONCURRENTLY allows concurrent DML at the cost of 2-3x slower build
50TimescaleDB hypertable?Auto-partitioned table by time. Looks like one table, internally chunked. Enables efficient time-range queries and retention policies

SQL Quick Reference

JOIN Syntax

-- INNER JOIN: rows matching in both tables
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN: all rows from left + matching from right (NULL if no match)
SELECT c.name, o.id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

-- FULL OUTER JOIN: all rows from both, NULL where no match
SELECT c.name, o.id
FROM customers c
FULL OUTER JOIN orders o ON c.id = o.customer_id;

-- CROSS JOIN: cartesian product (every combination)
SELECT c.name, p.name
FROM customers c
CROSS JOIN products p;

-- SELF JOIN: table joined to itself (e.g., manager hierarchy)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Window Functions

-- ROW_NUMBER: unique sequential number per partition
SELECT name, department, salary,
       ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees;

-- RANK vs DENSE_RANK: RANK skips after ties, DENSE_RANK doesn't
SELECT name, salary,
       RANK() OVER (ORDER BY salary DESC) AS rank,           -- 1, 2, 2, 4
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank  -- 1, 2, 2, 3
FROM employees;

-- LEAD/LAG: access next/previous row's value
SELECT date, revenue,
       LAG(revenue, 1) OVER (ORDER BY date) AS prev_revenue,
       revenue - LAG(revenue, 1) OVER (ORDER BY date) AS daily_change
FROM daily_sales;

-- Running total / cumulative sum
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;

CTEs & Recursive CTEs

-- Common Table Expression (CTE)
WITH top_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(amount) > 10000
)
SELECT c.name, tc.total_spent
FROM top_customers tc
JOIN customers c ON tc.customer_id = c.id;

-- Recursive CTE: traverse an org hierarchy
WITH RECURSIVE org_tree AS (
    -- Base case: start from the CEO (no manager)
    SELECT id, name, manager_id, 0 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: find reports of current level
    SELECT e.id, e.name, e.manager_id, ot.depth + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT REPEAT('  ', depth) || name AS org_chart, depth
FROM org_tree
ORDER BY depth, name;

UPSERT (INSERT ON CONFLICT)

-- PostgreSQL UPSERT
INSERT INTO page_views (url, view_count, last_viewed)
VALUES ('/about', 1, NOW())
ON CONFLICT (url)
DO UPDATE SET
    view_count = page_views.view_count + 1,
    last_viewed = EXCLUDED.last_viewed;

-- MySQL UPSERT
INSERT INTO page_views (url, view_count, last_viewed)
VALUES ('/about', 1, NOW())
ON DUPLICATE KEY UPDATE
    view_count = view_count + 1,
    last_viewed = NOW();

Aggregation with HAVING & Grouping

-- HAVING filters groups (WHERE filters rows before grouping)
SELECT department, COUNT(*) AS emp_count, AVG(salary) AS avg_salary
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5 AND AVG(salary) > 80000
ORDER BY avg_salary DESC;

-- GROUPING SETS: multiple groupings in one query
SELECT department, region, SUM(salary) AS total
FROM employees
GROUP BY GROUPING SETS (
    (department, region),  -- by dept + region
    (department),          -- subtotal by dept
    ()                     -- grand total
);

Database Selection Decision Guide

Use this table during system design interviews to quickly justify your database choice:

RequirementRecommended DatabaseWhy
ACID transactions with complex queriesPostgreSQLFull SQL, MVCC, strong consistency, rich extensions (PostGIS, TimescaleDB, pgvector)
ACID transactions with easy opsMySQL / AuroraBattle-tested, excellent replication, Aurora for managed auto-scaling
High write throughput, distributedCassandraLSM tree (sequential writes), linear horizontal scaling, tunable consistency
Flexible schema with rich queriesMongoDBDocument model, aggregation pipeline, horizontal scaling via sharding
Caching / real-time leaderboardsRedisIn-memory, sub-ms latency, sorted sets, pub/sub, 100K+ ops/sec
Session store / rate limitingRedisKey expiry (TTL), atomic counters, distributed locks (Redlock)
Graph traversals (social, fraud)Neo4jIndex-free adjacency, Cypher query language, efficient multi-hop traversals
Time-series data (metrics, IoT)TimescaleDB / InfluxDBTime-based partitioning, compression, continuous aggregates, retention policies
Full-text searchElasticsearchInverted index, relevance scoring, fuzzy matching, faceted search
Global distribution with strong consistencyCockroachDB / SpannerDistributed SQL, serializable isolation, automatic sharding, multi-region
Message queue / event streamingKafka (not a DB, but stores)Durable log, exactly-once semantics, consumer groups, compacted topics
OLAP / analytics on large datasetsClickHouse / BigQueryColumnar storage, vectorized execution, massive parallel processing
Embedded database (mobile/edge)SQLiteServerless, zero-config, single-file, full SQL, used by every smartphone
Key-value with strong consistencyetcd / ZooKeeperRaft consensus, linearizable reads/writes, service discovery, config storage
Multi-model (document + graph + search)ArangoDB / FaunaDBSingle engine for multiple data models, reduces operational complexity
💡
Interview tip: In system design, always state your database choice with a reason. Don't just say "I'll use PostgreSQL." Say: "I'll use PostgreSQL because we need ACID transactions for payment processing, and the read:write ratio is 10:1 which suits its MVCC model. For the recommendation feed, I'll use Redis as a cache with a 5-minute TTL to reduce query load." Showing you understand the tradeoff is more important than the specific choice.