Master Notebook:
Database Systems

The definitive interview preparation guide covering every database concept tested at FAANG & Big Tech companies — from storage engines to distributed transactions at scale.

13 Modules • 26 Diagrams • 50+ Interview Q&A • Complete Cheat Sheet

Module 1

Database Fundamentals

Core properties, theorems, and classification — the foundation every database discussion builds upon.

ACID Properties

ACID is the gold standard for transactional databases. Every relational database guarantees these four properties for each transaction.

Atomicity

A transaction is an indivisible unit — it either completes entirely or has no effect at all. If any part of a transaction fails, the entire transaction is rolled back. The database uses the Write-Ahead Log (WAL) to achieve this: before modifying any data page, the change is first recorded in the WAL. On crash, the WAL is replayed to undo incomplete transactions.

Consistency

A transaction moves the database from one valid state to another. All constraints (PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, NOT NULL) must be satisfied before and after the transaction. This is enforced by the database engine at commit time.

Isolation

Concurrent transactions execute as if they were serial. In practice, databases offer multiple isolation levels (from Read Uncommitted to Serializable) trading off consistency for performance. Isolation is implemented through locks (pessimistic) or MVCC (optimistic).

Durability

Once a transaction commits, its changes persist even through power failures or crashes. Achieved by flushing the WAL to disk before acknowledging the commit. The data pages can be written lazily via background checkpointing.

-- Example: ACID transaction in PostgreSQL
BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;
  -- If either UPDATE fails, both are rolled back (Atomicity)
  -- Constraints checked at commit (Consistency)
  -- Other transactions see old OR new balances, never partial (Isolation)
  -- Once committed, changes survive crashes (Durability)
COMMIT;

Under the Hood: WAL ensures Atomicity + Durability

The Write-Ahead Log (WAL) is the backbone of ACID in any serious database. Here is how it works:

Before any data page is modified in the buffer pool, a log record describing the change is appended to the WAL.
The WAL is written to disk sequentially — this is fast because sequential I/O is orders of magnitude faster than random I/O.
On COMMIT, the WAL records are flushed to disk (fsync). Only then does the client receive confirmation.
The actual data pages (heap files) are written lazily by a background checkpointer process.
On crash recovery, the database replays the WAL: REDO committed transactions whose data pages were not yet flushed, and UNDO uncommitted transactions that partially wrote data pages.

PostgreSQL calls this mechanism pg_wal (formerly pg_xlog). MySQL/InnoDB calls it the redo log (with a separate undo log for rollback). The key insight: the WAL turns random writes into sequential writes, dramatically improving throughput.

BASE Properties

BASE is the alternative consistency model favored by many distributed NoSQL databases. It trades strong consistency for availability and partition tolerance.

Basically Available — The system guarantees availability as defined by the CAP theorem. Every request receives a response (not an error), though the response may contain stale data.
Soft state — The state of the system may change over time, even without new input, due to eventual consistency mechanisms like anti-entropy and read repair propagating updates.
Eventually consistent — Given enough time without new updates, all replicas will converge to the same value. The convergence window depends on the system — it could be milliseconds or seconds.

ACID vs BASE Comparison

Property	ACID	BASE
Consistency model	Strong consistency	Eventual consistency
Availability	May sacrifice during partitions	Prioritizes availability
Focus	Correctness	Performance & availability
Scaling approach	Vertical (scale up)	Horizontal (scale out)
Transactions	Full ACID transactions	Relaxed, eventual
Use cases	Banking, inventory, booking	Social feeds, analytics, IoT
Examples	PostgreSQL, MySQL, Oracle	Cassandra, DynamoDB, Riak

CAP Theorem

Proposed by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002, the CAP theorem states that a distributed data store can provide at most two of three guarantees simultaneously:

Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time.
Availability (A) — Every request receives a non-error response, without the guarantee that it contains the most recent write.
Partition Tolerance (P) — The system continues to operate despite network partitions (messages being dropped or delayed between nodes).

In practice, network partitions are inevitable in distributed systems, so the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance). A "CA" system only exists in a non-distributed context (single node).

Figure 1.1 — CAP Theorem triangle with database classification

⚠

Gotcha: CAP describes behavior DURING a partition, not during normal operation. During normal operation (no partitions), a system can provide both consistency and availability. CAP only forces a choice when a network partition actually occurs. MongoDB, for instance, is fully consistent and available when there is no partition — but during a partition, it chooses consistency (CP) by making the minority partition unavailable.

PACELC Extension

Daniel Abadi proposed PACELC (2012) to address CAP's limitation — it says nothing about system behavior when there is no partition. PACELC adds the tradeoff during normal operation:

If Partition (P), choose Availability (A) or Consistency (C); Else (E), choose Latency (L) or Consistency (C).

Database	During Partition (PAC)	Else (ELC)	Classification
PostgreSQL	PC (consistency)	EC (consistency)	PC/EC
Cassandra	PA (availability)	EL (latency)	PA/EL
MongoDB	PC (consistency)	EC (consistency)	PC/EC
DynamoDB	PA (availability)	EL (latency)	PA/EL
Cosmos DB	PA (availability)	EL (latency)	PA/EL (tunable)
Spanner	PC (consistency)	EC (consistency)	PC/EC

Database Types

Relational (SQL)

Stores data in tables with rows and columns, enforces schemas, supports JOIN operations, and provides ACID transactions. Best for structured data with complex relationships. Examples: PostgreSQL, MySQL, Oracle, SQL Server.

Document

Stores semi-structured data as JSON/BSON documents. Each document can have a different structure, enabling schema flexibility. Best for content management, user profiles, catalogs. Examples: MongoDB, CouchDB, Firestore.

Key-Value

Simplest NoSQL model — stores data as key-value pairs. Extremely fast for point lookups. Best for caching, session management, feature flags. Examples: Redis, Memcached, DynamoDB, etcd.

Wide-Column (Column-Family)

Stores data in column families with flexible columns per row. Optimized for write-heavy workloads and time-series data. Best for IoT, logging, analytics. Examples: Cassandra, HBase, ScyllaDB.

Graph

Stores data as nodes and relationships (edges) with properties on both. Uses index-free adjacency for O(1) traversal. Best for social networks, fraud detection, recommendations. Examples: Neo4j, Amazon Neptune, TigerGraph.

Time-Series

Optimized for time-stamped data with high ingestion rates, automatic compression, and downsampling. Best for monitoring, metrics, IoT sensor data. Examples: TimescaleDB, InfluxDB, Prometheus.

Vector

Stores high-dimensional vectors for similarity search using algorithms like HNSW or IVFFlat. Best for AI/ML embeddings, semantic search, recommendation engines. Examples: Pinecone, Milvus, pgvector, Weaviate.

Figure 1.2 — Data model comparison: same data represented across different database types

Under the Hood: How WAL turns random writes into sequential writes

Without WAL, updating a row means finding the correct 8KB page on disk (random I/O), modifying it, and writing it back (another random I/O). With WAL:

Append a log record to the end of the WAL file — this is sequential I/O, which is 100-1000x faster than random I/O on spinning disks and 10-50x faster on SSDs.
Modify the page in the in-memory buffer pool (no disk I/O at all).
Periodically, the checkpointer writes dirty pages to disk in batches, coalescing multiple changes to the same page.

This is why PostgreSQL can sustain thousands of transactions per second even on modest hardware — most of the actual data page writes happen asynchronously.

DBMS Execution Stack — parser → planner → executor → storage

Module 2

Relational Model & SQL

Normalization, joins, subqueries, window functions, and query optimization — the bread and butter of database interviews.

Normalization

Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity. Each normal form builds on the previous one.

First Normal Form (1NF)

Each column contains only atomic (indivisible) values. No repeating groups or arrays.

-- Violation of 1NF: multi-valued column
-- | id | name  | phones              |
-- | 1  | Alice | 555-1234, 555-5678  |

-- Fixed (1NF compliant):
CREATE TABLE contacts (
  id    INT,
  name  VARCHAR(100),
  phone VARCHAR(20)
);
-- | id | name  | phone    |
-- | 1  | Alice | 555-1234 |
-- | 1  | Alice | 555-5678 |

Second Normal Form (2NF)

Must be in 1NF, and every non-key attribute must depend on the entire primary key (no partial dependencies). Only relevant for composite primary keys.

-- Violation of 2NF: student_name depends only on student_id, not the full PK
-- PK: (student_id, course_id)
-- | student_id | course_id | student_name | grade |
-- student_name depends only on student_id (partial dependency)

-- Fixed: separate the partial dependency
CREATE TABLE students (student_id INT PRIMARY KEY, student_name VARCHAR(100));
CREATE TABLE enrollments (student_id INT, course_id INT, grade CHAR(2),
  PRIMARY KEY (student_id, course_id));

Third Normal Form (3NF)

Must be in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies).

-- Violation of 3NF: city depends on zip_code, not directly on id
-- | id | name  | zip_code | city       |
-- city depends on zip_code (transitive dependency)

-- Fixed: separate the transitive dependency
CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR(100), zip_code VARCHAR(10));
CREATE TABLE zip_codes (zip_code VARCHAR(10) PRIMARY KEY, city VARCHAR(100));

Boyce-Codd Normal Form (BCNF / 3.5NF)

A stricter version of 3NF. Every determinant must be a candidate key. Handles edge cases where 3NF allows certain anomalies when a non-prime attribute is part of a candidate key.

Fourth Normal Form (4NF)

Must be in BCNF with no multi-valued dependencies. If an entity has two or more independent multi-valued facts, they should be in separate tables.

Fifth Normal Form (5NF)

Must be in 4NF with no join dependencies. A table is in 5NF if it cannot be decomposed into smaller tables without loss of data. Rarely needed in practice.

When to Denormalize

Read-heavy workloads — Joins are expensive at scale; denormalized data avoids them.
Reporting / analytics — Star/snowflake schemas deliberately denormalize for query performance.
Caching layers — Pre-computed, denormalized views reduce latency.
Microservices — Each service owns its data; cross-service joins are impossible, so denormalization is common.

💡

Interview tip: The most common normalization interview question tests 1NF through 3NF. Know concrete examples for each. For system design, be ready to argue when to denormalize — the answer is usually "when read performance matters more than write consistency."

Joins

SQL joins combine rows from two or more tables based on a related column. Understanding each join type and its performance implications is critical.

INNER JOIN

Returns only rows that have matching values in both tables.

SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;

LEFT (OUTER) JOIN

Returns all rows from the left table, with matching rows from the right table. Unmatched right-side columns are NULL.

SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- Employees without departments will show dept_name as NULL

RIGHT (OUTER) JOIN

Returns all rows from the right table, with matching rows from the left table. Equivalent to swapping tables in a LEFT JOIN.

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match, NULLs fill in the missing side.

CROSS JOIN

Returns the Cartesian product — every row from the left table paired with every row from the right table. If left has M rows and right has N rows, result has M × N rows.

SELF JOIN

Joins a table to itself. Commonly used for hierarchical data (employee → manager) or comparing rows within the same table.

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Figure 2.1 — Venn diagram representation of SQL JOIN types

⚠

Gotcha: LEFT JOIN with WHERE on right table = INNER JOIN. If you filter on a column from the right table in the WHERE clause (e.g., WHERE d.location = 'NYC'), NULL rows from unmatched left rows are eliminated, effectively converting your LEFT JOIN into an INNER JOIN. Put the condition in the ON clause instead to preserve the LEFT JOIN behavior.

Subqueries & CTEs

Correlated vs Non-Correlated Subqueries

A non-correlated subquery executes once and its result is used by the outer query. A correlated subquery references columns from the outer query and executes once per outer row — much slower for large datasets.

-- Non-correlated: runs once
SELECT name FROM employees
WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC');

-- Correlated: runs once per outer row
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
  SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id
);

EXISTS vs IN

EXISTS short-circuits as soon as a match is found — faster for large subquery results. IN materializes the entire subquery result set first — faster when the subquery returns a small result set.

-- EXISTS: faster when subquery table is large
SELECT name FROM employees e
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.employee_id = e.id);

-- IN: faster when subquery returns few values
SELECT name FROM employees
WHERE dept_id IN (1, 2, 3);

Common Table Expressions (CTEs)

CTEs improve readability and enable recursive queries. They are defined with WITH and can reference themselves for hierarchical data.

-- Basic CTE
WITH dept_stats AS (
  SELECT dept_id, AVG(salary) AS avg_salary, COUNT(*) AS cnt
  FROM employees
  GROUP BY dept_id
)
SELECT d.name, ds.avg_salary, ds.cnt
FROM departments d
JOIN dept_stats ds ON d.id = ds.dept_id;

-- Recursive CTE: org chart traversal
WITH RECURSIVE org_tree AS (
  -- Base case: CEO (no manager)
  SELECT id, name, manager_id, 0 AS depth
  FROM employees WHERE manager_id IS NULL
  UNION ALL
  -- Recursive case: employees with managers
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth;

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them (unlike GROUP BY). They are essential for ranking, running totals, and comparative analytics.

-- ROW_NUMBER, RANK, DENSE_RANK comparison
SELECT name, dept_id, salary,
  ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS row_num,
  RANK()       OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dense_rank
FROM employees;
-- ROW_NUMBER: 1,2,3,4 (always unique)
-- RANK:       1,2,2,4 (gaps after ties)
-- DENSE_RANK: 1,2,2,3 (no gaps after ties)

-- LEAD / LAG: access adjacent rows
SELECT date, revenue,
  LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
  revenue - LAG(revenue, 1) OVER (ORDER BY date) AS daily_change
FROM daily_sales;

-- Running total
SELECT date, amount,
  SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;

-- Top N per group (classic interview question)
WITH ranked AS (
  SELECT name, dept_id, salary,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;

Query Optimization Patterns

The N+1 Query Problem

Occurs when an application executes 1 query to fetch N parent records, then N additional queries to fetch related data for each parent. This results in N+1 total queries instead of 1 or 2.

-- N+1 problem (pseudocode):
-- Query 1: SELECT * FROM orders        → returns 100 orders
-- Query 2-101: SELECT * FROM items WHERE order_id = ?  (×100)
-- Total: 101 queries!

-- Solution: JOIN or batch fetch
SELECT o.*, i.*
FROM orders o
LEFT JOIN order_items i ON o.id = i.order_id;
-- Total: 1 query

-- Or batch with IN:
SELECT * FROM order_items WHERE order_id IN (1,2,3,...,100);
-- Total: 2 queries

Other Optimization Patterns

SELECT only needed columns — Avoid SELECT *; it wastes I/O and prevents covering index optimization.
Use EXISTS instead of COUNT — WHERE EXISTS (...) stops after first match; WHERE (SELECT COUNT(...)) > 0 scans all matching rows.
Avoid functions on indexed columns — WHERE YEAR(created_at) = 2024 prevents index use; use range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
Use UNION ALL instead of UNION — UNION removes duplicates (requires sort); UNION ALL does not.
Paginate with keyset, not OFFSET — WHERE id > last_id LIMIT 20 is O(1) vs. OFFSET 10000 LIMIT 20 which is O(n).

Under the Hood: Query optimizer rewrites your SQL before executing it

The query optimizer is arguably the most complex component of a relational database. Before execution, it applies numerous transformations:

Predicate pushdown — Moves WHERE conditions closer to base tables, filtering rows as early as possible.
Join reordering — Tries different join orders to find the one with the lowest estimated cost. For N tables, there are N! possible orders.
Subquery decorrelation — Converts correlated subqueries into joins when possible.
View merging — Inlines view definitions to enable cross-view optimization.
Constant folding — Pre-computes constant expressions at planning time.

PostgreSQL uses a dynamic programming approach for up to ~12 tables, then switches to a genetic algorithm (GEQO) for larger join counts. The optimizer uses table statistics (row counts, histograms, most common values) collected by ANALYZE.

Hash Join vs Nested-Loop — same inputs, different access patterns

Module 3

Indexing

B-Trees, B+ Trees, LSM Trees, hash indexes, and composite index strategy — the key to database performance.

B-Tree & B+ Tree

B-Tree

A self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in O(log n) time. Each node can have multiple children (high fan-out), which minimizes tree height and thus disk I/O.

Balanced — All leaf nodes are at the same depth, guaranteeing worst-case O(log n) lookups.
High fan-out — Each node holds many keys (typically filling one disk page), so tree height is small (3-4 levels for billions of rows).
Data in all nodes — Both internal and leaf nodes store actual data pointers.

Figure 3.1 — B-Tree structure showing node splits during insertion

B+ Tree

The B+ Tree is a variant where all data is stored only in leaf nodes, and leaf nodes are linked together in a doubly-linked list. This makes range queries extremely efficient — once you find the starting leaf, you follow pointers to traverse consecutive values.

Internal nodes store only keys — More keys per internal node = higher fan-out = shorter tree.
Leaf linked list — Enables efficient range scans and ORDER BY without re-traversing the tree.
All data at leaf level — Every search requires the same number of disk reads (tree height), making performance predictable.

Figure 3.2 — B+ Tree showing range query traversal via linked leaf nodes

Feature	B-Tree	B+ Tree
Data location	Internal + leaf nodes	Leaf nodes only
Internal node capacity	Lower (stores data)	Higher (keys only)
Range queries	Requires tree traversal	Leaf linked list traversal
Point lookups	Can be faster (data in internal)	Always same depth
Used in	General purpose	Most RDBMS (InnoDB, PostgreSQL)

Hash Index

Hash indexes use a hash function to map keys directly to bucket locations, providing O(1) average-case point lookups. However, they cannot support range queries, ordering, or partial key lookups.

Point lookups — Fastest possible: hash the key, go directly to the bucket.
No range queries — WHERE price BETWEEN 10 AND 50 cannot use a hash index.
No ordering — ORDER BY cannot use a hash index.
No partial matching — WHERE name LIKE 'Al%' cannot use a hash index.

-- PostgreSQL hash index (rarely used in practice, B-tree is usually better)
CREATE INDEX idx_users_email ON users USING hash (email);

-- Good for: exact equality lookups
SELECT * FROM users WHERE email = 'alice@example.com';

-- Cannot use hash index for:
SELECT * FROM users WHERE email > 'a';  -- range
SELECT * FROM users ORDER BY email;     -- ordering

LSM Tree (Log-Structured Merge Tree)

LSM Trees are write-optimized data structures used by write-heavy databases like Cassandra, RocksDB, LevelDB, and HBase. They convert random writes into sequential writes.

How LSM Trees Work

Write to memtable — Incoming writes go to an in-memory sorted data structure (red-black tree or skip list). This is extremely fast.
Flush to SSTable — When the memtable reaches a threshold, it is flushed to disk as an immutable, sorted SSTable (Sorted String Table).
Compaction — Background process merges multiple SSTables, removing duplicates and tombstones, producing larger sorted files.
Read path — Check memtable first, then each SSTable level (using Bloom filters to skip SSTables that definitely do not contain the key).

Figure 3.3 — LSM Tree architecture: write path (memtable → SSTable) and compaction flow

Feature	B+ Tree	LSM Tree
Write performance	Moderate (random I/O)	Excellent (sequential I/O)
Read performance	Excellent (single tree)	Good (may check multiple levels)
Space amplification	Low	Higher (multiple copies during compaction)
Write amplification	Low-moderate	Higher (compaction rewrites data)
Range queries	Excellent (leaf list)	Good (after compaction)
Used in	PostgreSQL, MySQL/InnoDB	RocksDB, Cassandra, LevelDB

Specialized Indexes

Inverted Index (Full-Text Search)

Maps terms to the documents/rows containing them. Used by Elasticsearch, PostgreSQL tsvector, and MySQL FULLTEXT. Essential for full-text search.

GIN (Generalized Inverted Index)

PostgreSQL index type for composite values — arrays, JSONB, full-text search (tsvector). Supports operators like @> (contains), && (overlap).

GiST (Generalized Search Tree)

PostgreSQL index for geometric, range, and nearest-neighbor queries. Supports PostGIS spatial queries and range types.

Covering Index (INCLUDE columns)

An index that includes all columns needed by a query, avoiding table lookups entirely. PostgreSQL supports INCLUDE to add non-key columns.

-- Covering index: query answered entirely from index
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (total_amount, status);

-- This query uses only the index (index-only scan):
SELECT order_date, total_amount, status
FROM orders
WHERE customer_id = 42;

Partial Index

An index on a subset of rows, defined with a WHERE clause. Smaller, faster, and ideal for skewed data.

-- Index only active users (5% of total) — much smaller than full index
CREATE INDEX idx_active_users ON users (email)
WHERE is_active = true;

Composite Index & Index Strategy

Leftmost Prefix Rule

A composite index on (A, B, C) can serve queries that filter on:

A alone
A and B
A, B, and C

But it cannot efficiently serve queries on B alone, C alone, or B, C. The leftmost column must always be present.

Index Selection Strategy

Selectivity — Index columns that filter out the most rows. A column with 10 distinct values in 10M rows has low selectivity; one with 9M distinct values has high selectivity.
Cardinality — Number of distinct values. Higher cardinality = better index candidate.
Query patterns — Index columns that appear in WHERE, JOIN ON, ORDER BY, and GROUP BY clauses.
Write overhead — Each index slows writes (inserts, updates, deletes). Only create indexes that are actually used.

⚠

Gotcha: Composite index (A,B,C) can serve queries on A; A,B; A,B,C — but NOT B alone. The index is sorted by A first, then by B within each A value, then by C within each (A,B) pair. Querying on B alone would require scanning the entire index (equivalent to a full table scan). If you frequently query on B alone, you need a separate index on B.

Under the Hood: InnoDB clustered index stores row data in the primary key B+ Tree

In InnoDB (MySQL's default storage engine), the primary key index IS the table. The leaf nodes of the primary key B+ Tree contain the actual row data. This is called a clustered index.

There can be only ONE clustered index per table (the primary key).
Secondary indexes store the indexed columns + the primary key value, NOT a pointer to the row.
A secondary index lookup requires TWO tree traversals: once on the secondary index to find the primary key, then once on the primary key index to find the row data. This is called a bookmark lookup or double lookup.
This is why InnoDB primary keys should be compact (INT or BIGINT, not UUID) — the PK is stored in every secondary index.

PostgreSQL, by contrast, uses a heap table with separate index structures. Secondary indexes contain a tuple ID (TID) pointing directly to the heap location — only one lookup needed.

B+ Tree Range Query — page walk for WHERE age BETWEEN 24 AND 32

Clustered vs Non-Clustered Index Layout

LSM-Tree Compaction — level merges over time

Module 4

Transactions & Concurrency

Isolation levels, MVCC, two-phase locking, and deadlock detection — how databases handle concurrent access safely.

Isolation Levels

SQL defines four standard isolation levels, each allowing progressively fewer concurrency anomalies at the cost of reduced throughput.

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
Read Uncommitted	Possible	Possible	Possible	Fastest
Read Committed	Prevented	Possible	Possible	Fast (PostgreSQL default)
Repeatable Read	Prevented	Prevented	Possible	Moderate (MySQL default)
Serializable	Prevented	Prevented	Prevented	Slowest

-- Setting isolation level in PostgreSQL
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
  SELECT balance FROM accounts WHERE id = 1;  -- reads 1000
  -- Another transaction updates balance to 500 and commits
  SELECT balance FROM accounts WHERE id = 1;  -- still reads 1000 (snapshot)
COMMIT;

Read Anomalies

Dirty Read

Transaction T2 reads data written by T1 before T1 commits. If T1 rolls back, T2 has read data that never existed.

Non-Repeatable Read

Transaction T1 reads a row, T2 modifies and commits that row, then T1 reads the same row again and gets a different value.

Phantom Read

Transaction T1 reads a set of rows matching a condition, T2 inserts a new row matching that condition and commits, then T1 re-reads and finds the new "phantom" row.

Figure 4.1 — Timeline diagrams showing dirty read, non-repeatable read, and phantom read anomalies

Multi-Version Concurrency Control (MVCC)

MVCC is the dominant concurrency control mechanism in modern databases. Instead of locking rows, the database keeps multiple versions of each row and serves each transaction the version that was current at the transaction's start time (snapshot).

PostgreSQL MVCC

Each tuple (row version) has two hidden columns: xmin (transaction ID that created it) and xmax (transaction ID that deleted/updated it, or 0 if still alive). When a row is updated, PostgreSQL creates a new tuple with a new xmin and marks the old tuple with the updater's xmax.

MySQL/InnoDB MVCC

InnoDB uses an undo log chain. Each row has a hidden DB_TRX_ID (last modifying transaction) and DB_ROLL_PTR (pointer to the previous version in the undo log). To read an old version, InnoDB walks the undo log chain backwards until it finds a version visible to the reading transaction.

Figure 4.2 — MVCC version chain: each transaction sees a consistent snapshot

Feature	Optimistic (MVCC)	Pessimistic (Locking)
Approach	Allow concurrent access, validate at commit	Lock resources before access
Readers block writers?	No	Yes (shared locks block exclusive)
Writers block readers?	No	Yes (exclusive locks block shared)
Conflicts detected	At commit time	At access time
Best when	Low contention, read-heavy	High contention, write-heavy
Overhead	Storage for old versions	Lock management, potential deadlocks
Used by	PostgreSQL, MySQL/InnoDB, Oracle	SQL Server (default), DB2

Locking & Two-Phase Locking (2PL)

Two-Phase Locking is a concurrency control protocol that guarantees serializability. Every transaction follows two phases:

Growing phase — Transaction acquires locks but never releases any.
Shrinking phase — Transaction releases locks but never acquires any new ones.

Lock Types

Shared (S) lock — For reading. Multiple transactions can hold shared locks on the same resource simultaneously.
Exclusive (X) lock — For writing. Only one transaction can hold an exclusive lock. Blocks both shared and exclusive lock requests.
Intent locks — Table-level locks that signal the intent to lock rows. IS (Intent Shared), IX (Intent Exclusive). Used to avoid checking every row for conflicting locks.

-- Explicit locking in PostgreSQL
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;      -- Exclusive row lock
SELECT * FROM accounts WHERE id = 1 FOR SHARE;       -- Shared row lock
SELECT * FROM accounts WHERE id = 1 FOR UPDATE SKIP LOCKED; -- Skip already-locked rows
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;    -- Error immediately if locked

Deadlock Detection & Prevention

A deadlock occurs when two or more transactions are waiting for each other to release locks, forming a cycle. No transaction can proceed.

Figure 4.3 — Wait-for graph showing a deadlock cycle between T1 and T2

Detection & Prevention

Wait-for graph — Database periodically builds a graph of transaction waits. If a cycle is detected, one transaction (the victim) is aborted and rolled back.
Timeout — If a lock wait exceeds a timeout, the waiting transaction is aborted. Simple but can cause unnecessary aborts.
Wait-Die — Older transactions wait for younger ones; younger transactions are aborted if they request a lock held by an older one.
Wound-Wait — Older transactions "wound" (abort) younger ones holding locks they need; younger transactions wait for older ones.

⚠

Gotcha: MySQL Repeatable Read can still allow phantom reads in certain edge cases. MySQL's InnoDB uses "gap locks" and "next-key locks" to prevent phantoms in most cases under Repeatable Read, but gaps in the locking mechanism mean certain complex queries can still experience phantom-like behavior. For guaranteed phantom prevention, use the Serializable isolation level.

Under the Hood: PostgreSQL VACUUM reclaims dead MVCC tuples

Because MVCC creates new row versions on every update, old versions (dead tuples) accumulate over time. PostgreSQL's VACUUM process reclaims this space:

Regular VACUUM — Marks dead tuples as reusable space within the page. Does not return space to the OS. Non-blocking.
VACUUM FULL — Rewrites the entire table to a new file, compacting it. Returns space to OS but requires an exclusive lock (blocks all queries).
Autovacuum — Background daemon that triggers VACUUM automatically based on thresholds (default: when 20% of rows are dead tuples).

If autovacuum falls behind (e.g., long-running transactions keep old tuples "visible"), the table bloats — rows are scattered across many pages, increasing I/O for sequential scans. The most dangerous case is transaction ID wraparound, where PostgreSQL must freeze old tuples before the 32-bit transaction counter wraps around (~2 billion transactions).

Isolation Anomalies Matrix

2PL vs MVCC — two-transaction timeline on row X

Deadlock Wait-For Graph — cycle detection & victim

Module 5

Query Processing

From SQL text to execution plan — parsing, optimization, join algorithms, and EXPLAIN output interpretation.

Query Pipeline

Every SQL query goes through a multi-stage pipeline before any data is returned. Understanding this pipeline is crucial for debugging performance issues.

Figure 5.1 — Query processing pipeline: Parse → Analyze → Optimize → Execute → Return

Parse — The SQL string is tokenized and parsed into an Abstract Syntax Tree (AST). Syntax errors are caught here.
Analyze — Table and column names are resolved against the catalog. Types are checked. Permissions are verified.
Optimize — The query optimizer generates multiple execution plans and estimates the cost of each using table statistics. Selects the cheapest plan.
Execute — The executor runs the chosen plan, pulling data from storage through the buffer pool, applying filters, joins, sorts, and aggregations.

Cost-Based Optimizer

The cost-based optimizer estimates the I/O and CPU cost of different execution plans and chooses the cheapest. It relies on table statistics maintained by ANALYZE.

Row count estimates — How many rows will each filter reduce to?
Histograms — Distribution of values in each column, allowing accurate selectivity estimation for range predicates.
Most common values (MCV) — The N most frequent values and their frequencies, used for equality predicates.
Distinct value count (n_distinct) — Estimated number of unique values per column.
Correlation — How well the physical order of rows correlates with the logical (index) order. High correlation makes index scans more efficient.

-- Viewing PostgreSQL table statistics
SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  correlation
FROM pg_stats
WHERE tablename = 'orders';

Join Algorithms

The database engine has multiple physical algorithms for executing joins. The optimizer chooses based on table sizes, available indexes, and memory.

Nested Loop Join

For each row in the outer table, scan the inner table for matches. O(M × N) without an index on the inner table. Best when one table is very small or an index exists on the join column of the inner table.

Hash Join

Build a hash table from the smaller table (build phase), then probe it with each row from the larger table (probe phase). O(M + N). Best for large, unsorted tables without indexes. Requires memory for the hash table.

Sort-Merge Join

Sort both tables on the join key, then merge them in a single pass. O(M log M + N log N) for sorting, O(M + N) for merging. Best when both tables are already sorted (or an index provides the sort order) and for range joins.

Figure 5.2 — Three join algorithms: Nested Loop, Hash Join, and Sort-Merge Join

Algorithm	Complexity	Memory	Best For	Requires
Nested Loop	O(M × N)	Low	Small outer table, indexed inner	Nothing (most general)
Hash Join	O(M + N)	High (hash table)	Large unsorted tables, equi-joins	Equality condition
Sort-Merge	O(M log M + N log N)	Moderate (sort buffers)	Pre-sorted data, range joins	Sortable join key

EXPLAIN & ANALYZE

EXPLAIN shows the query plan the optimizer chose. EXPLAIN ANALYZE actually executes the query and shows real timings.

-- PostgreSQL EXPLAIN ANALYZE output
EXPLAIN ANALYZE
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.total > 100;

-- Output:
-- Hash Join  (cost=12.50..45.30 rows=150 width=36)
--           (actual time=0.250..0.890 rows=143 loops=1)
--   Hash Cond: (o.customer_id = c.id)
--   -> Seq Scan on orders o  (cost=0.00..28.50 rows=150 width=12)
--              (actual time=0.010..0.450 rows=143 loops=1)
--        Filter: (total > 100)
--        Rows Removed by Filter: 857
--   -> Hash  (cost=10.00..10.00 rows=200 width=28)
--              (actual time=0.180..0.180 rows=200 loops=1)
--        Buckets: 256  Batches: 1  Memory Usage: 15kB
--        -> Seq Scan on customers c  (cost=0.00..10.00 rows=200 width=28)
--              (actual time=0.005..0.090 rows=200 loops=1)
-- Planning Time: 0.150 ms
-- Execution Time: 1.050 ms

Key Scan Types

Seq Scan — Full table scan. Reads every row. Used when no suitable index exists or when a large percentage of rows is needed.
Index Scan — Uses B+ Tree index to find matching rows, then fetches row data from heap. Used for selective queries.
Index Only Scan — Answers the query entirely from the index without touching the heap table. Fastest for covered queries.
Bitmap Index Scan — Scans the index to build a bitmap of matching pages, then fetches those pages. Efficient for medium-selectivity queries.

⚠

Gotcha: Fast on small data does not mean fast at scale. A query that runs in 1ms on 1,000 rows might take 30 seconds on 10 million rows because the optimizer chooses a different plan. Always run EXPLAIN ANALYZE on production-sized data or realistic data distributions. The number one cause of production performance issues is testing only on small datasets.

Under the Hood: Cardinality estimation using histograms and MCV

The optimizer's cost estimates depend heavily on cardinality estimates — predicting how many rows each operation will return. This is done using:

Most Common Values (MCV) — For equality predicates like WHERE status = 'active', the optimizer looks up the frequency of 'active' in the MCV list. If 'active' has frequency 0.7, it estimates 70% of rows will match.
Histograms — For range predicates like WHERE price BETWEEN 10 AND 50, the optimizer uses the histogram to estimate what fraction of values fall in that range. PostgreSQL uses equi-depth histograms with 100 buckets by default.
Distinct count — For joins, the optimizer estimates the join selectivity as 1 / max(n_distinct_a, n_distinct_b).

When estimates are wrong (e.g., due to correlated columns, stale statistics, or data skew), the optimizer may choose a terrible plan. Running ANALYZE regularly keeps statistics fresh. PostgreSQL's autovacuum daemon includes auto-analyze.

Logical → Physical Plan Tree — SELECT AVG(o.total) FROM users u JOIN orders o ON u.id=o.uid WHERE u.country='US'

Cost-Based Join Order — picking a left-deep plan over A ⋈ B ⋈ C ⋈ D

Module 6

Storage Engine Internals

Pages, buffer pools, Write-Ahead Logging, and the physical reality beneath the SQL abstraction.

Pages & Blocks

Databases do not read or write individual rows. The fundamental unit of I/O is a page (also called a block). Pages are fixed-size units, typically 8KB in PostgreSQL and 16KB in MySQL/InnoDB.

Slotted Page Format

Most databases use a slotted page layout where a page header is followed by a slot array (item pointers) at the beginning, and tuple data is stored from the end of the page growing towards the slot array. Free space exists between the slot array and the tuple data.

Figure 6.1 — Slotted page layout: header at top, slot array grows down, tuples grow up

Buffer Pool / Cache

The buffer pool is a large region of shared memory that caches frequently accessed disk pages. It is the most critical performance component — a warm buffer pool can make the difference between microsecond and millisecond response times.

Page table — Hash map from (tablespace, file, page#) to buffer pool frame number.
Pin count — Tracks how many concurrent operations are using a page. Pinned pages cannot be evicted.
Dirty flag — Marks pages that have been modified in memory but not yet written to disk.
LRU eviction — When the buffer pool is full and a new page is needed, the least-recently-used unpinned, clean page is evicted. Dirty pages must be flushed first.

Figure 6.2 — Buffer pool with page table mapping, dirty page tracking, and disk I/O

Write-Ahead Log (WAL)

The WAL is the mechanism that provides durability (the D in ACID) and atomicity (the A in ACID). The fundamental rule: a log record describing a change must be written to stable storage before the data page itself is written.

Figure 6.3 — WAL sequence: write to WAL buffer, fsync to disk, ACK commit, lazy checkpoint to data files

Storage Engine Comparison

Feature	InnoDB (MySQL)	MyISAM (MySQL)
ACID compliant	Yes	No
Row-level locking	Yes (MVCC)	Table-level locking only
Foreign keys	Supported	Not supported
Crash recovery	WAL (redo log)	Manual repair required
Full-text search	Yes (since 5.6)	Yes
Clustered index	Yes (PK is clustered)	No (heap table)
Buffer pool	Yes (innodb_buffer_pool_size)	Key cache only (key_buffer_size)
MVCC	Yes	No
Use case	OLTP, general purpose	Read-heavy, legacy

RocksDB & LSM Tree Tradeoffs

RocksDB (used by CockroachDB, TiKV, MyRocks) is built on LSM trees. The three fundamental tradeoffs in LSM storage are:

Write amplification — Data is written multiple times during compaction. A write amplification of 10x means every byte written by the application results in 10 bytes written to disk.
Read amplification — A point read may need to check multiple SSTable levels. Bloom filters mitigate this.
Space amplification — Multiple versions of data exist during compaction, temporarily using more space.

⚠

Gotcha: InnoDB secondary index stores the primary key, requiring two lookups. When you query via a secondary index in InnoDB, the engine first traverses the secondary B+ Tree to find the primary key value, then traverses the primary key (clustered) B+ Tree to find the actual row data. This "bookmark lookup" doubles the I/O for non-covering secondary index queries. This is why covering indexes (that include all needed columns) are especially valuable in InnoDB.

Under the Hood: fsync() and O_DIRECT — how databases ensure data hits disk

Modern operating systems use page caches that buffer writes in memory before flushing to disk. For databases, this is dangerous — a crash could lose "committed" data still sitting in the OS page cache.

fsync() — System call that forces the OS to flush all buffered writes for a file descriptor to the physical storage device. PostgreSQL calls fsync on the WAL file after every commit (by default). This is the "gold standard" for durability.
O_DIRECT — Flag to bypass the OS page cache entirely, writing directly to disk. Used by InnoDB to avoid double-buffering (data in both InnoDB buffer pool and OS page cache). PostgreSQL uses O_DIRECT for WAL writes in some configurations.
fdatasync() — Like fsync but skips metadata updates (timestamps). Slightly faster in some cases.

Warning: Some cheap SSDs or cloud volumes may acknowledge writes before data is truly persistent (violating the fsync contract). Enterprise-grade storage and well-configured cloud volumes (e.g., AWS EBS io2) properly honor fsync.

Row-store (HEAP) vs Column-store Page Layout

WAL Write → Checkpoint → Crash Recovery Timeline

Module 7

Replication

Single-leader, multi-leader, leaderless replication, and consensus protocols — how databases stay in sync.

Single-Leader (Master-Replica) Replication

One node (the leader/primary) accepts all writes. Replicas (followers/secondaries) receive a stream of changes and apply them. All reads can go to any node, but writes only go to the leader.

Synchronous vs Asynchronous

Synchronous — Leader waits for at least one replica to confirm it has written the change before acknowledging the client. Guarantees the replica has the data (no data loss on leader failure), but adds latency to every write.
Asynchronous — Leader acknowledges the client immediately after its own WAL write. Replicas catch up later. Lower write latency, but replicas can lag behind (stale reads) and data can be lost if the leader crashes before replicating.
Semi-synchronous — Leader waits for one replica (but not all). MySQL's semi-sync replication guarantees at least one replica has the data.

Figure 7.1 — Single-leader replication with sync and async replicas showing replication lag

Multi-Leader Replication

Multiple nodes accept writes, each replicating to the others. Used for multi-datacenter deployments where each datacenter has its own leader. The main challenge is conflict resolution when two leaders modify the same data concurrently.

Conflict Resolution Strategies

Last Write Wins (LWW) — Use timestamps to pick the "latest" write. Simple but can silently discard data. Used by Cassandra and DynamoDB.
Merge function — Application-specific logic merges conflicting values (e.g., set union for tags).
Custom resolution — Conflicts are logged and presented to the application or user for resolution.
CRDTs — Conflict-free Replicated Data Types that are mathematically guaranteed to converge (e.g., G-Counter, OR-Set).

Leaderless Replication & Quorum

No single leader — any node can accept writes. Uses quorum reads and writes to ensure consistency. Made famous by Amazon's Dynamo paper (2007), used by Cassandra and Riak.

Quorum Formula

With N replicas, if W + R > N, reads are guaranteed to see the latest write (at least one node in the read quorum overlaps with the write quorum).

N=3, W=2, R=2 — Standard quorum. Tolerates 1 node failure for both reads and writes.
N=3, W=3, R=1 — Fast reads (only need 1 node), but writes are slow and cannot tolerate any failures.
N=3, W=1, R=3 — Fast writes (only need 1 node), but reads are slow.

Figure 7.2 — Quorum write with W=2, N=3: write succeeds when 2 of 3 nodes acknowledge

Consensus Protocols

Raft

Raft (2014, Diego Ongaro) is a consensus protocol designed to be understandable. It decomposes consensus into three sub-problems:

Leader election — Nodes start as followers. If a follower does not hear from the leader within an election timeout, it becomes a candidate and requests votes. A candidate with a majority becomes the new leader.
Log replication — The leader receives client requests, appends them to its log, and replicates log entries to followers. Once a majority confirms, the entry is committed.
Safety — Guarantees that if an entry is committed, it will be present in the logs of all future leaders.

Figure 7.3 — Raft consensus state machine: leader election flow

Feature	Raft	Paxos
Year	2014	1989 (published 1998)
Design goal	Understandability	Theoretical correctness
Leader	Strong leader required	Can be leaderless (Multi-Paxos uses leader)
Log ordering	Strict sequential	Can have gaps
Membership changes	Joint consensus (built-in)	Complex extension
Used by	etcd, CockroachDB, TiKV, Consul	Google Chubby, Spanner
Implementation complexity	Moderate	High

⚠

Gotcha: Async replication means read replicas can return stale data. If a client writes to the leader and immediately reads from a replica, it may not see its own write. Solutions include: (1) read-after-write consistency by routing the client's reads to the leader for recently-written data, (2) monotonic reads by pinning a client to a specific replica, (3) using synchronous replication for critical paths.

Under the Hood: MySQL binary log formats — statement, row, mixed

MySQL replication is based on the binary log (binlog), which records all changes to the database. Three formats exist:

Statement-based (SBR) — Logs the SQL statement itself. Compact but can produce different results on replicas (e.g., NOW(), RAND(), non-deterministic functions).
Row-based (RBR) — Logs the before and after images of each modified row. Deterministic and safe, but verbose for bulk operations like UPDATE users SET active = false (logs every changed row).
Mixed — Uses statement-based by default, switches to row-based for non-deterministic statements. Best of both worlds.

PostgreSQL uses WAL-based streaming replication instead, which is always "row-based" at the WAL record level. Logical replication (introduced in PG 10) decodes the WAL into logical change events, enabling selective table replication and cross-version replication.

Leader-Follower Log Shipping — replication lag visualized

Module 8

Partitioning & Sharding

Horizontal vs vertical partitioning, shard key selection, consistent hashing, and cross-shard query patterns.

Horizontal vs Vertical Partitioning

Horizontal Partitioning (Sharding)

Splits rows across multiple nodes. Each shard holds a subset of the rows but all columns. For example, users with IDs 1-1M on shard 1, IDs 1M-2M on shard 2.

Vertical Partitioning

Splits columns across multiple tables or nodes. Frequently accessed columns stay in one partition, rarely used or large columns (e.g., BLOBs) in another. This is essentially normalization taken to the physical level.

Figure 8.1 — Horizontal partitioning (sharding by rows) vs vertical partitioning (splitting by columns)

Partitioning Strategies

Range Partitioning

Assign contiguous ranges of the shard key to each partition. Example: dates (Jan-Mar → shard 1, Apr-Jun → shard 2). Simple, supports range queries, but can create hot spots (e.g., the current month always gets the most writes).

Hash Partitioning

Apply a hash function to the shard key and assign hash ranges to partitions. Distributes data more evenly, but destroys key ordering — range queries must scatter to all partitions.

Shard Key Selection Criteria

High cardinality — The shard key should have many distinct values to distribute evenly.
Even distribution — Avoid keys where most data maps to one shard (hot spots).
Query affinity — Choose a key that keeps related data on the same shard (e.g., tenant_id for multi-tenant apps).
Write distribution — Avoid monotonically increasing keys (like auto-increment IDs) with range partitioning — all writes go to the last partition.

Figure 8.2 — Range vs hash partitioning: distribution characteristics and tradeoffs

Resharding & Consistent Hashing

When adding or removing shards, data must be redistributed. Naive hash partitioning (hash(key) % N) is catastrophic here — changing N remaps almost every key.

Consistent Hashing

Maps both keys and nodes to a hash ring. Each key is assigned to the first node clockwise from its position on the ring. Adding or removing a node only affects keys between the new/removed node and its predecessor. Used by DynamoDB, Cassandra, and Riak.

Virtual Nodes (vnodes)

Each physical node is represented by many "virtual nodes" on the ring. This improves distribution uniformity and makes rebalancing smoother when nodes are added/removed.

Cross-Shard Queries

Queries that span multiple shards are inherently more complex and slower. Common patterns include:

Scatter-Gather — Send the query to all relevant shards, collect partial results, merge them at the coordinator. Used for aggregations and non-shard-key queries.
Global indexes — Maintain a secondary index across all shards to enable queries on non-shard-key columns. The index itself must be distributed.
Denormalization — Duplicate data across shards to avoid cross-shard joins. Accept write overhead for read performance.

⚠

Gotcha: Sharding is a last resort. Before sharding, exhaust these simpler scaling options: (1) read replicas for read scaling, (2) caching for repeated queries, (3) vertical scaling (bigger machine), (4) query optimization and indexing, (5) connection pooling. Sharding introduces operational complexity (cross-shard queries, distributed transactions, resharding), and it is very hard to undo.

Under the Hood: Vitess — transparent MySQL sharding used by YouTube, Slack

Vitess is an open-source database clustering system for horizontal scaling of MySQL. It sits between the application and MySQL, providing transparent sharding:

VTGate — The proxy layer that parses queries, determines which shard(s) to route to, and merges results.
VTTablet — Runs alongside each MySQL instance, managing connection pooling, query rewriting, and health checks.
Topo Server — Metadata store (etcd/ZooKeeper/Consul) tracking shard assignments, schema, and cluster topology.
Resharding — Vitess supports online resharding by creating new shards, copying data, then atomically switching traffic — without downtime.

Vitess was developed at YouTube to handle the scale of billions of videos and users. It is now a CNCF graduated project used by Slack, Pinterest, GitHub, and Square.

Consistent Hashing Ring — node join/leave only moves O(K/N) keys

Module 9

NoSQL Deep Dive

Redis, MongoDB, Cassandra, Neo4j — architecture, data models, and when to choose each.

Redis

Redis is an in-memory data structure server. It is not just a key-value store — it supports rich data structures with atomic operations on each.

Core Data Structures

Strings — Binary-safe strings up to 512MB. Supports SET, GET, INCR, APPEND. Can store serialized JSON, counters, or bitmaps.
Hashes — Maps of field-value pairs. Think of a hash as a miniature row. HSET user:1 name "Alice" age 30.
Lists — Doubly-linked lists. O(1) push/pop from both ends. Used for queues, activity feeds.
Sets — Unordered collections of unique strings. Supports UNION, INTERSECT, DIFF. Used for tags, unique visitors.
Sorted Sets (ZSets) — Sets where each member has a score. Sorted by score. O(log N) add/remove/lookup. Used for leaderboards, rate limiting, priority queues.
Streams — Append-only log structure with consumer groups. Used for event streaming, similar to Kafka topics at smaller scale.

Persistence Options

Feature	RDB Snapshots	AOF (Append-Only File)
Mechanism	Point-in-time snapshot (fork + serialize)	Log every write operation
Recovery speed	Fast (load binary dump)	Slower (replay log)
Data loss on crash	Up to last snapshot interval	At most 1 second (with fsync=everysec)
File size	Compact	Larger (but supports rewrite/compaction)
CPU impact	Fork can be expensive with large datasets	Minimal (sequential writes)
Recommended	Backups, disaster recovery	Durability requirement

Figure 9.1 — Redis internals: single-threaded event loop, in-memory data, persistence options

MongoDB

MongoDB is a document database storing data as BSON (Binary JSON) documents. Each document can have a different structure, enabling schema flexibility within a collection.

Key Concepts

Document — A JSON-like record (BSON internally). Can contain nested objects and arrays.
Collection — A group of documents (analogous to a table). No schema enforcement by default.
Replica Set — A group of MongoDB instances maintaining the same data. One primary, multiple secondaries.
Aggregation Pipeline — A series of stages ($match, $group, $sort, $project, $lookup) for data transformation and analysis.

// MongoDB aggregation pipeline example
db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
  { $group: {
      _id: "$customer_id",
      total_spent: { $sum: "$amount" },
      order_count: { $sum: 1 }
  }},
  { $sort: { total_spent: -1 } },
  { $limit: 10 }
]);

Cassandra

Apache Cassandra is a wide-column distributed database designed for high write throughput and linear scalability. It uses a masterless (peer-to-peer) architecture with tunable consistency.

Data Model

Partition key — Determines which node stores the data. Hashed for even distribution.
Clustering key — Determines the sort order of rows within a partition. Enables efficient range queries within a partition.
Wide rows — A partition can contain up to 2 billion cells, enabling time-series and event log patterns.

-- Cassandra table with composite primary key
CREATE TABLE sensor_data (
  sensor_id   UUID,
  event_time  TIMESTAMP,
  temperature DOUBLE,
  humidity    DOUBLE,
  PRIMARY KEY (sensor_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- sensor_id = partition key (which node)
-- event_time = clustering key (sort within partition)

Figure 9.2 — Cassandra data layout: partition key determines node, clustering key sorts within partition

Neo4j & Graph Databases

Neo4j uses the property graph model where data is stored as nodes (entities), relationships (edges), and properties on both. Its key advantage is index-free adjacency: each node physically stores pointers to its neighbors, making traversals O(1) per hop regardless of total graph size.

// Cypher query: find friends of friends
MATCH (p:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(fof)
WHERE fof <> p AND NOT (p)-[:FRIENDS_WITH]->(fof)
RETURN fof.name, count(friend) AS mutual_friends
ORDER BY mutual_friends DESC LIMIT 10;

// Cypher query: shortest path
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:KNOWS*]-(b:Person {name: "Bob"})
)
RETURN path;

NoSQL Data Modeling Patterns

Embedding vs Referencing

Embedding — Nest related data inside a single document. Faster reads (single query), but can lead to large documents and data duplication. Best for 1:1 and 1:few relationships that are always read together.
Referencing — Store related data in separate documents/collections with references (like foreign keys). Avoids duplication but requires multiple queries or application-level joins. Best for 1:many and many:many relationships.

Fan-Out Patterns

Fan-out-on-write — When a user posts, push the post to all followers' timelines immediately. High write cost but fast reads (pre-computed timeline). Used by Twitter for most users.
Fan-out-on-read — When a user opens their timeline, pull posts from all followed users and merge. Low write cost but slow reads (compute at read time). Used for celebrity users with millions of followers.

Feature	Redis	MongoDB	Cassandra	Neo4j
Data model	Data structures	Document (BSON)	Wide-column	Property graph
Storage	In-memory	Disk (WiredTiger)	Disk (LSM tree)	Disk (native graph)
Consistency	Strong (single node)	Tunable (W concern)	Tunable (quorum)	ACID
Scaling	Cluster (hash slots)	Sharding (mongos)	Peer-to-peer ring	Causal cluster
Best for	Caching, queues, sessions	Content, catalogs, profiles	Time-series, IoT, logs	Social, fraud, recommendations
Query language	Commands	MQL / Aggregation	CQL (SQL-like)	Cypher
Weak at	Large datasets (RAM cost)	Multi-doc transactions	Ad-hoc queries, joins	Large-scale analytics

⚠

Gotcha: MongoDB is schema-flexible, not schema-less. Without schema validation, you can insert documents with any structure into a collection. In practice, this leads to chaos. Always use MongoDB's $jsonSchema validator to enforce the expected document structure at the database level. In interviews, distinguish between "schemaless" (no enforcement) and "schema-flexible" (schema is not rigid but should still be defined).

Under the Hood: Cassandra achieves high write throughput via LSM tree + commit log

Cassandra's write path is optimized for sequential I/O:

Commit log — Every write is first appended to the commit log (WAL equivalent) for durability. This is sequential I/O.
Memtable — The write is then applied to an in-memory sorted data structure (memtable). This is very fast.
SSTable flush — When the memtable reaches a threshold, it is flushed to disk as an immutable SSTable.
Compaction — Background process merges SSTables, removing tombstones and consolidating data.

Because both the commit log and SSTable writes are sequential, Cassandra can sustain extremely high write throughput. A single Cassandra node can handle 10K-50K writes/second, and throughput scales linearly with the number of nodes.

CAP Triangle — pick two under a network partition

Module 10

Distributed Databases

2PC, Saga, consistency models, Google Spanner's TrueTime, and vector clocks — coordination across nodes.

Two-Phase Commit (2PC)

2PC is a distributed transaction protocol that ensures all participants either commit or abort a transaction atomically.

Phase 1: Prepare

The coordinator sends a PREPARE message to all participants. Each participant writes the transaction to its WAL (making it durable) and responds with VOTE_COMMIT or VOTE_ABORT.

Phase 2: Commit/Abort

If all participants voted COMMIT, the coordinator sends COMMIT to all. If any voted ABORT, the coordinator sends ABORT to all.

Figure 10.1 — Two-Phase Commit sequence: Prepare phase (vote), then Commit phase (final decision)

Three-Phase Commit (3PC)

3PC adds a PRE-COMMIT phase between PREPARE and COMMIT. This allows participants to distinguish between a coordinator crash during Phase 1 (safe to abort) and during Phase 2 (safe to commit). Reduces the blocking window of 2PC but does not fully eliminate it in the presence of network partitions.

Saga Pattern

Sagas are an alternative to distributed transactions for long-running business processes. Instead of locking resources across services, each service executes a local transaction and publishes an event. If a step fails, compensating transactions undo previous steps.

Choreography

Each service listens for events and decides what to do next. No central coordinator. Simple for few steps but hard to debug and monitor as complexity grows.

Orchestration

A central orchestrator service tells each participant what step to execute. Easier to monitor and debug, but the orchestrator is a single point of logic.

// Saga: Order placement (orchestration)
// Step 1: Reserve inventory       → Compensate: Release inventory
// Step 2: Charge payment           → Compensate: Refund payment
// Step 3: Ship order               → Compensate: Cancel shipment
//
// If Step 2 fails:
//   → Run compensate for Step 1 (release inventory)
//   → Order marked as failed

Consistency Models

Model	Guarantee	Strength	Example
Linearizability	Ops appear instantaneous, in real-time order	Strongest	Spanner (external consistency)
Sequential consistency	Ops appear in SOME total order consistent with per-process order	Strong	ZooKeeper
Causal consistency	Causally related ops seen in order; concurrent ops may differ	Moderate	MongoDB (causal sessions)
Eventual consistency	All replicas converge given enough time	Weakest	Cassandra (ONE/ONE), DynamoDB

Linearizability vs Serializability:

Linearizability is a recency guarantee — once a write is acknowledged, all subsequent reads see it. It applies to single operations on single objects.
Serializability is a transaction guarantee — the result of executing concurrent transactions is equivalent to some serial order. It applies to groups of operations.
Strict serializability (= "serializable + linearizable") provides both. Google Spanner achieves this via TrueTime.

Google Spanner & TrueTime

Google Spanner is a globally distributed, strongly consistent relational database. Its key innovation is TrueTime — an API that provides bounded clock uncertainty using GPS receivers and atomic clocks in every data center.

TrueTime API

TT.now() returns an interval [earliest, latest] — the true time is guaranteed to be within this interval. Spanner waits out the uncertainty window before committing, ensuring that if transaction T1 commits before T2 starts (in real time), then T1's timestamp is less than T2's timestamp.

Figure 10.2 — TrueTime commit protocol: Spanner waits out clock uncertainty to ensure global ordering

CockroachDB: Hybrid Logical Clocks

CockroachDB cannot use GPS/atomic clocks (it runs on commodity hardware), so it uses Hybrid Logical Clocks (HLC) that combine physical timestamps with logical counters. HLCs provide causal ordering but not the external consistency of TrueTime. CockroachDB compensates with a "clockless reads" mode for strict serializability.

Vector Clocks

Vector clocks track causality in distributed systems. Each node maintains a vector of counters (one per node). When a node sends a message, it increments its own counter and attaches the vector. The receiver merges vectors by taking the element-wise max.

Figure 10.3 — Vector clock progression: each node increments its own counter and merges on receive

Vector clocks can detect concurrent events (neither causally precedes the other) vs causally ordered events. DynamoDB (original Dynamo paper) used vector clocks for conflict detection, though the production system now uses Last-Write-Wins for simplicity.

⚠

Gotcha: 2PC is a blocking protocol. If the coordinator fails after sending PREPARE but before sending COMMIT, all participants are stuck holding locks, waiting for the coordinator's decision. They cannot safely commit (maybe the coordinator decided to abort) or abort (maybe it decided to commit). This is the fundamental limitation of 2PC — solved by 3PC (partially) and Paxos/Raft-based commit protocols (fully).

Under the Hood: Spanner uses GPS receivers + atomic clocks to achieve < 7ms uncertainty

Every Google data center has a set of "time master" machines. There are two types:

GPS masters — Equipped with GPS receivers that synchronize with satellite atomic clocks. Multiple GPS masters provide redundancy.
Armageddon masters — Equipped with local atomic clocks (rubidium or cesium) as backup in case GPS signal is lost.

Regular machines poll multiple time masters and use Marzullo's algorithm to compute a tight bound on the true time. The result is an uncertainty interval, typically less than 7 milliseconds (dominated by network RTT within a region). This small uncertainty means Spanner only needs to wait ~7ms per transaction to ensure global ordering — a very modest cost for global strong consistency.

Raft Leader Election + Log Replication across 5 nodes

Module 11

Caching

Caching strategies, Redis internals, cache invalidation patterns, and failure modes — the art of keeping hot data close.

Caching Strategies

There are five fundamental caching strategies, each with different consistency and performance tradeoffs. Choosing the right one depends on your read/write ratio, consistency requirements, and tolerance for stale data.

1. Cache-Aside (Lazy Loading)

The application manages the cache directly. On read miss, fetch from DB, populate cache, return. On write, update DB, then invalidate cache entry.

def get_user(user_id):
    # 1. Check cache first
    user = cache.get(f"user:{user_id}")
    if user is not None:
        return user  # Cache hit

    # 2. Cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)

    # 3. Populate cache with TTL
    cache.set(f"user:{user_id}", user, ttl=300)
    return user

def update_user(user_id, data):
    db.update("UPDATE users SET ... WHERE id = %s", data, user_id)
    cache.delete(f"user:{user_id}")  # Invalidate, don't update

Pros: Only requested data is cached (no wasted memory). Cache failure doesn't break the app — just slower. Cons: Cache miss = three round trips. Stale data window between DB write and cache invalidation.

2. Read-Through

The cache itself loads data from the DB on a miss. The application only talks to the cache, never directly to the DB for reads. The cache library handles the DB fetch logic.

Pros: Simpler application code. Cons: Cache library must know how to query DB. First request is always slow (cold cache).

3. Write-Through

Every write goes to the cache first, and the cache synchronously writes to the DB. Data in cache is always consistent with DB.

Pros: Cache is never stale. Reads are always fast. Cons: Write latency increases (two writes per operation). Caches data that may never be read.

4. Write-Behind (Write-Back)

Write to cache immediately. The cache asynchronously flushes writes to the DB in batches. Significantly faster writes, but risk of data loss if cache crashes before flush.

Pros: Very fast writes. Batching reduces DB load. Cons: Data loss risk. Complex failure handling. DB may be behind cache.

5. Write-Around

Write directly to DB, bypassing the cache entirely. The cache is only populated on read misses (via cache-aside). Best for write-heavy workloads where written data is rarely re-read immediately.

Pros: Cache not polluted with write-heavy data. Cons: Recent writes always cause cache misses.

Figure 11.1 — Five caching strategies: data flow between Application, Cache, and Database

Strategy	Read Path	Write Path	Consistency	Best For
Cache-Aside	App → Cache (miss → DB → fill)	App → DB → invalidate cache	Eventual (stale window)	General purpose, read-heavy
Read-Through	App → Cache (auto-loads from DB)	App → DB (separate)	Eventual	Simplifying app code
Write-Through	App → Cache (always hot)	App → Cache → DB (sync)	Strong	Read-heavy, consistency needed
Write-Behind	App → Cache	App → Cache → DB (async batch)	Weak (lag)	Write-heavy, high throughput
Write-Around	App → Cache (miss → DB → fill)	App → DB only	Eventual	Write-heavy, rarely re-read

Redis Deep Dive

Redis is a single-threaded, in-memory data structure server. Despite being single-threaded for command execution, it handles 100K+ operations/second because it uses non-blocking I/O multiplexing (epoll on Linux, kqueue on macOS).

Architecture: Single-Threaded Event Loop

Redis processes all commands sequentially on a single thread. This eliminates lock contention and context switching. Network I/O is handled via multiplexing — Redis registers file descriptors with epoll and processes events as they become ready.

Figure 11.2 — Redis single-threaded event loop with I/O multiplexing: all clients share one thread

Eviction Policies

When Redis hits its maxmemory limit, it must evict keys. The policy determines which keys are removed:

Policy	Scope	Algorithm	Use Case
`noeviction`	N/A	Reject writes	When data loss is unacceptable
`allkeys-lru`	All keys	Approx. LRU	General cache (most common)
`volatile-lru`	Keys with TTL	Approx. LRU	Mix of persistent + cache keys
`allkeys-lfu`	All keys	Approx. LFU	Frequency-based (Redis 4.0+)
`allkeys-random`	All keys	Random	Uniform access pattern
`volatile-ttl`	Keys with TTL	Shortest TTL first	Expire soon-to-die keys first

Redis uses approximate LRU/LFU — it samples N keys (default 5) and evicts the least recently/frequently used among samples. This is much cheaper than true LRU (no doubly-linked list overhead) and nearly as accurate.

Redis Cluster

Redis Cluster partitions data across multiple nodes using 16,384 hash slots. Each key maps to a slot via CRC16(key) mod 16384. Each master node owns a subset of slots.

Gossip protocol — Nodes exchange cluster state via periodic PING/PONG messages (cluster bus on port+10000).
Resharding — Slots can be migrated between nodes live. During migration, MOVED and ASK redirects guide clients to the correct node.
Failover — If a master fails, its replica is promoted via cluster consensus (majority of masters must agree).
Multi-key limitation — Multi-key operations (MGET, pipeline) only work if all keys hash to the same slot. Use hash tags {user:123}.name to force co-location.

Cache Invalidation

Phil Karlton famously said: "There are only two hard things in Computer Science: cache invalidation and naming things." Here are the strategies:

TTL-Based (Time-To-Live)

Set an expiry time on each cache entry. After TTL expires, the next request triggers a fresh fetch. Simple but blunt — data may be stale for the entire TTL window.

cache.set("product:42", product_data, ttl=60)  # Stale for up to 60 seconds

Event-Driven Invalidation

Invalidate cache entries when the underlying data changes. Three approaches:

Application-level — Delete cache key after DB write (fragile — can miss code paths).
Change Data Capture (CDC) — Tail the database's WAL/binlog (e.g., Debezium for PostgreSQL/MySQL). Guaranteed to capture every change.
Webhooks/Events — Publish invalidation events to a message bus (Kafka, RabbitMQ). Cache consumers subscribe and invalidate.

Versioned Keys

Instead of invalidating, increment a version counter and embed it in the cache key:

# Version stored in a separate fast-changing counter
version = cache.incr("product:42:version")
cache.set(f"product:42:v{version}", new_data, ttl=3600)
# Old versions naturally expire via TTL

Write-Invalidate vs Write-Update

Write-Invalidate — Delete the cache key on write. Next read re-fetches from DB. Simpler, safer (no race conditions on concurrent writes).
Write-Update — Update the cache value on write. Faster subsequent reads, but concurrent writes can leave stale data (last writer wins in cache may differ from DB).

Recommendation: Prefer write-invalidate. Write-update only makes sense if reads vastly outnumber writes AND you can tolerate occasional inconsistency.

Cache Failure Patterns

Three critical failure modes can bring down your database when caching goes wrong:

1. Thundering Herd (Cache Stampede)

A popular cache key expires. Hundreds of concurrent requests all miss the cache simultaneously and hit the database, potentially causing a cascading failure.

Solutions:

Mutex/Lock — First request acquires a distributed lock, fetches from DB, populates cache. Other requests wait or return stale data.
Request coalescing — Collapse duplicate in-flight requests into one (singleflight pattern in Go).
Early refresh — Probabilistically refresh before TTL expiry. At time T, each request has probability P = (now - created) / TTL of triggering a background refresh.

2. Cache Penetration

Requests for data that does not exist in the DB repeatedly bypass the cache (because there is nothing to cache) and hit the database. Can be an attack vector.

Solutions:

Cache null results — Store empty/null marker in cache with short TTL (e.g., 30s).
Bloom filter — Check a bloom filter before querying. If the key is definitely not in the DB, return immediately without DB hit. False positives are OK (just cause an occasional cache miss).

3. Cache Avalanche

Many cache keys expire at the same time (e.g., all set with the same TTL at startup). The sudden flood of cache misses overwhelms the database.

Solutions:

Staggered TTL — Add random jitter to TTL: ttl = base_ttl + random(0, jitter).
Warm-up on deploy — Pre-populate cache before routing traffic to new instances.
Multi-tier caching — L1 (in-process) + L2 (Redis). L1 has shorter TTL and absorbs some misses.

Figure 11.3 — Three cache failure patterns: thundering herd, cache penetration, and cache avalanche

Cache Warming

A cold cache after deployment or restart can cause a sudden spike in database load. Cache warming strategies mitigate this:

Pre-load on deploy — Before routing traffic, populate the cache with the most frequently accessed keys. Query your analytics or access logs to determine the hot set.
Background refresh — A background worker continuously refreshes soon-to-expire keys, ensuring the cache never goes cold. Useful for keys with predictable access patterns.
Materialized cache — Pre-compute and store aggregated results (e.g., "top 100 products") in the cache. Refresh on a schedule rather than on demand.
Shadow traffic — Route a copy of production traffic to the new cache instance before cutover, warming it with real access patterns.

⚠

Gotcha: Write-behind can lose data. If the cache crashes before flushing pending writes to the database, those writes are lost permanently. Always log write-behind operations to a durable write-ahead log for recovery. Consider using Redis AOF with appendfsync everysec as a compromise between performance and durability.

Under the Hood: Redis achieves 100K+ ops/sec single-threaded via epoll I/O multiplexing

Redis never blocks waiting for a single client. Instead:

epoll_wait() — Redis registers all client socket file descriptors with epoll. It calls epoll_wait() which blocks until ANY socket has data ready.
Read ready events — When a client sends a command, epoll notifies Redis. Redis reads the command from the socket buffer (non-blocking), parses it, executes it against in-memory data structures.
Write response — The response is written to the client's output buffer. If the socket buffer is full, Redis registers a write event and moves on.
Repeat — The event loop processes the next ready event. No thread context switches, no locks, no blocking.

Since Redis 6.0, I/O threading was added for network read/write (parsing and sending responses), but command execution remains single-threaded. This gives ~2x throughput improvement on multi-core machines while preserving atomicity guarantees.

LRU vs LFU Eviction — cache capacity=3, access stream A B C A B D A E

Module 12

Database for System Design

Connection pooling, read replicas, database proxies, schema migrations, and time-series databases — production database patterns.

Connection Pooling

Every database connection consumes server memory (~10MB in PostgreSQL for the backend process). Opening a connection takes 50-100ms for TCP handshake + TLS + authentication. Connection pooling reuses connections to avoid this overhead.

How Pooling Works

A pool manager (PgBouncer, HikariCP, etc.) maintains a set of pre-established connections. Application threads check out a connection, execute queries, and return it to the pool. The pool handles lifecycle management.

Pool Sizing: Little's Law

The optimal pool size can be derived from Little's Law: L = λW, where L = average connections in use, λ = request rate, W = average query duration.

# Example: 1000 req/sec, avg query time 10ms
L = λ × W = 1000 × 0.01 = 10 connections needed

# HikariCP recommendation (practical formula):
pool_size = (2 × num_cpu_cores) + num_spindle_disks
# For SSD: pool_size = 2 × cores + 1
# A 4-core server with SSD: pool_size = 9

Too few connections: Requests queue up waiting for a free connection. Too many connections: Database server thrashes — each connection consumes memory, and context switching between hundreds of processes degrades throughput.

Pooling Modes

Mode	Connection Assigned	Pros	Cons	Use Case
Session	For entire client session	Full feature support (prepared statements, SET, LISTEN)	Poor multiplexing — one pool conn per client	Long-lived app connections
Transaction	For one transaction only	Great multiplexing — released after COMMIT	No session-level features (SET, prepared stmts)	Most web apps (recommended)
Statement	For one query only	Maximum multiplexing	No transactions, no multi-statement support	Simple SELECT-only workloads

💡

Tip: PgBouncer in transaction mode is the most common production setup for PostgreSQL. It can multiplex 10,000 application connections onto 100 database connections. Set server_reset_query = DISCARD ALL to clean up session state between uses.

Read Replicas

Read replicas scale read throughput by distributing queries across multiple copies of the data. The primary handles all writes; replicas receive changes via replication.

Routing Strategies

Application-level routing — Application code explicitly routes writes to primary, reads to replica. Simple but requires code changes everywhere.
Proxy-based routing — A database proxy (ProxySQL, pgpool-II) intercepts queries and routes based on query type. Transparent to the application.
DNS-based routing — Different DNS entries for write endpoint (primary) and read endpoint (round-robin across replicas). Used by AWS RDS.

Handling Replication Lag

Async replicas can lag behind the primary. Three critical patterns to handle this:

Read-your-writes — After a write, route subsequent reads from the same user to the primary (or wait for replica to catch up). Implemented via session affinity or a "last write timestamp" cookie.
Monotonic reads — Ensure a user always reads from the same replica, so they never see data go "backwards." Pin sessions to a specific replica.
Causal consistency — If user A writes, then tells user B to check, B should see A's write. Track causal dependencies using logical timestamps or version vectors.

# Read-your-writes pattern with timestamp tracking
def route_query(user, query, last_write_ts):
    if query.is_write:
        result = primary.execute(query)
        user.last_write_ts = now()
        return result

    # If user wrote recently, read from primary
    if now() - user.last_write_ts < REPLICATION_LAG_THRESHOLD:
        return primary.execute(query)

    # Safe to read from replica
    return replica.execute(query)

Database Proxy

A database proxy sits between application and database servers, providing transparent query routing, load balancing, and connection multiplexing.

Popular Database Proxies

ProxySQL — MySQL proxy. Query routing rules, connection pooling, query caching, query rewriting, failover. Used at scale by many companies.
Vitess — MySQL sharding middleware (originally YouTube). Horizontal sharding, schema migrations, connection pooling, query routing. Now used by Slack, GitHub, HubSpot.
pgpool-II — PostgreSQL proxy. Connection pooling, load balancing, automated failover, parallel query execution.
PgBouncer — Lightweight PostgreSQL connection pooler. Does not route queries — purely pooling. Minimal overhead.

Proxy Capabilities

Feature	ProxySQL	Vitess	PgBouncer	pgpool-II
Connection pooling	Yes	Yes	Yes	Yes
Read/write splitting	Yes (rules)	Yes	No	Yes
Sharding	No	Yes	No	No
Query caching	Yes	Row cache	No	Yes
Schema migration	No	Yes (online)	No	No
Failover	Yes	Yes	No	Yes
Database	MySQL	MySQL	PostgreSQL	PostgreSQL

Schema Migration

Changing the schema of a production database with billions of rows and zero downtime is one of the hardest operational challenges. The naive approach (ALTER TABLE) can lock the table for hours.

Online DDL: gh-ost Approach

GitHub's gh-ost performs online schema changes without triggers:

Create shadow table — A new table with the desired schema.
Copy existing data — Bulk-copy rows from the original table to the shadow table in batches (throttled to not overload the server).
Tail binary log — Simultaneously capture all ongoing DML (INSERT/UPDATE/DELETE) from the MySQL binlog and apply them to the shadow table.
Cut-over — When the shadow table is caught up, atomically rename tables: original → _old, shadow → original.
Cleanup — Drop the old table after verification.

Expand-Contract Pattern (Backward-Compatible Migrations)

For zero-downtime deploys, schema changes must be backward compatible with both old and new application code running simultaneously:

Expand — Add new column (nullable or with default). Deploy app code that writes to BOTH old and new columns. Backfill existing rows.
Migrate — Deploy app code that reads from new column. Verify correctness.
Contract — Remove old column. Deploy code that only uses new column.

-- Step 1: Expand — add new column (safe, no lock)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Step 2: Backfill (in batches to avoid long transactions)
UPDATE users SET full_name = first_name || ' ' || last_name
WHERE id BETWEEN 1 AND 10000;

-- Step 3: Contract — drop old columns (after all code uses full_name)
ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;

⚠

Gotcha: CREATE INDEX on a 100M-row table locks writes. Standard CREATE INDEX in PostgreSQL acquires a SHARE lock on the table, blocking all INSERT/UPDATE/DELETE for the entire build duration (could be hours). Always use CREATE INDEX CONCURRENTLY — it takes 2-3x longer but allows concurrent writes. For MySQL, use pt-online-schema-change or gh-ost.

Time-Series Databases

Time-series data — metrics, IoT sensor readings, financial ticks, logs — has unique characteristics: high write volume, append-mostly, time-ordered queries, natural aging (old data less valuable).

TimescaleDB

TimescaleDB is a PostgreSQL extension that adds time-series optimizations while keeping full SQL compatibility:

Hypertables — Automatic time-based partitioning. A hypertable looks like one table but is internally partitioned into chunks (default: 1 week per chunk). Queries touching recent data only scan recent chunks.
Compression — Native columnar compression on old chunks (10-20x compression ratio). Compressed chunks are read-only but query-able.
Continuous aggregates — Materialized views that automatically refresh as new data arrives. Pre-compute rollups (hourly/daily averages) without manual cron jobs.
Data retention — Automated policies to drop or compress chunks older than N days. Much faster than DELETE (just drop the underlying table).

-- Create a hypertable (TimescaleDB)
CREATE TABLE metrics (
    time        TIMESTAMPTZ NOT NULL,
    device_id   INTEGER,
    temperature DOUBLE PRECISION,
    humidity    DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'time');

-- Continuous aggregate for hourly averages
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS bucket,
    device_id,
    AVG(temperature) AS avg_temp,
    MAX(temperature) AS max_temp
FROM metrics
GROUP BY bucket, device_id;

-- Retention policy: drop data older than 90 days
SELECT add_retention_policy('metrics', INTERVAL '90 days');

-- Compression policy: compress chunks older than 7 days
ALTER TABLE metrics SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id'
);
SELECT add_compression_policy('metrics', INTERVAL '7 days');

Time-Series DB Comparison

Feature	TimescaleDB	InfluxDB	Prometheus	ClickHouse
Query language	Full SQL	InfluxQL / Flux	PromQL	SQL dialect
Storage engine	PostgreSQL + chunks	TSM (custom LSM)	TSDB (custom)	MergeTree (columnar)
Compression	Native columnar	Run-length, delta	Gorilla encoding	LZ4, ZSTD, delta
Best for	IoT, metrics + relational	Metrics, DevOps	Monitoring, alerting	Analytics, OLAP
Joins	Full SQL joins	Limited	No	Yes
Scaling	Single-node + replicas	Cluster (enterprise)	Federation, Thanos	Distributed

Under the Hood: gh-ost performs online schema migration without triggers

Traditional online schema change tools (like pt-online-schema-change) use database triggers to capture changes during migration. This has problems: triggers add overhead to every DML operation, they can have bugs with complex schemas, and they require SUPER privilege.

gh-ost takes a different approach:

Binary log tailing — Instead of triggers, gh-ost connects as a MySQL replica and reads the binary log stream. This captures all DML events without any overhead on the source server.
Asynchronous row copy — Rows are copied from the original table to the ghost table in controlled batches. The batch size and throttle are configurable and can be adjusted live.
Cut-over via atomic rename — When the ghost table is caught up, gh-ost uses an atomic table rename (RENAME TABLE original TO _old, ghost TO original) for zero-downtime switch.
Throttling — gh-ost monitors replication lag and server load, automatically throttling to avoid impacting production traffic.

This approach is used at GitHub, where they run gh-ost on tables with billions of rows multiple times per week as part of their continuous deployment pipeline.

"Choose Your Database" Decision Tree

Module 13

Cheat Sheet

Module summaries, rapid-fire Q&A, SQL quick reference, and database selection guide — your interview day companion.

Module Summary Cards

1. Database Fundamentals

ACID: Atomicity, Consistency, Isolation, Durability
BASE: Basically Available, Soft state, Eventually consistent
CAP: Pick 2 of Consistency, Availability, Partition tolerance
PACELC: If P then A/C else L/C tradeoff
Types: Relational, Document, Key-Value, Wide-Column, Graph, Time-Series

2. Relational Model & SQL

Normal forms: 1NF (atomic) → 2NF (no partial deps) → 3NF (no transitive deps) → BCNF
Joins: INNER, LEFT, RIGHT, FULL, CROSS, SELF
Window functions: ROW_NUMBER, RANK, DENSE_RANK, LEAD/LAG, SUM OVER
CTEs with WITH clause; recursive CTEs for hierarchies
Denormalize for read-heavy, normalize for write-heavy

3. Indexing

B+ Tree: O(log n) search, range queries, ordered traversal
Hash Index: O(1) exact match, no range queries
LSM Tree: O(1) write (append), used by RocksDB, Cassandra
Composite index: leftmost prefix rule
Covering index: index-only scan, avoids table lookup
GIN for full-text, GiST for spatial, BRIN for sorted data

4. Transactions & Concurrency

Isolation: Read Uncommitted → Read Committed → Repeatable Read → Serializable
Anomalies: dirty read, non-repeatable read, phantom read, write skew
MVCC: multiple row versions, snapshot reads without blocking writes
2PL: growing phase (acquire) + shrinking phase (release)
Deadlock: waits-for graph cycle detection or timeout

5. Query Processing

Pipeline: Parse → Analyze → Optimize → Execute → Return
Cost-based optimizer uses pg_statistic, histograms, selectivity
Join algorithms: Nested Loop, Hash Join, Sort-Merge Join
EXPLAIN ANALYZE: actual vs estimated rows, sequential vs index scan
Always benchmark on production-sized data

6. Storage Engine Internals

Pages: 8KB fixed-size blocks, slotted page format
Buffer pool: page table, LRU/clock-sweep eviction, dirty tracking
WAL: write-ahead log for crash recovery, LSN ordering
InnoDB: clustered index, MVCC, row-level locking
fsync() forces disk flush, O_DIRECT bypasses OS cache

7. Replication

Single-leader: one writer, N followers, sync/async/semi-sync
Multi-leader: conflict resolution via LWW, CRDTs, custom rules
Leaderless: quorum W+R>N, sloppy quorum, hinted handoff
Raft: leader election, log replication, safety guarantee
Binlog formats: statement-based, row-based, mixed

8. Partitioning & Sharding

Horizontal (rows) vs vertical (columns) partitioning
Shard key: high cardinality, even distribution, query-aligned
Range (ordered, skew risk) vs hash (uniform, no range)
Consistent hashing with virtual nodes for resharding
Sharding is a last resort — try replicas, caching, optimization first

9. NoSQL Deep Dive

Redis: in-memory data structures, 100K+ ops/sec, RDB/AOF persistence
MongoDB: BSON documents, aggregation pipeline, change streams
Cassandra: wide-column, gossip protocol, LSM tree, linear scaling
Neo4j: property graph, Cypher queries, index-free adjacency
Modeling: embed for 1:few, reference for 1:many or many:many

10. Distributed Databases

2PC: prepare/commit, blocking if coordinator fails
3PC: adds pre-commit phase, non-blocking under certain failures
Saga: compensating transactions for long-lived workflows
Spanner: TrueTime API, GPS + atomic clocks, <7ms uncertainty
Vector clocks: detect causality and concurrent events

11. Caching

Strategies: cache-aside, read-through, write-through, write-behind, write-around
Redis: single-threaded, epoll multiplexing, 16384 hash slots
Eviction: allkeys-lru (most common), volatile-lru, allkeys-lfu
Invalidation: TTL, CDC/events, versioned keys
Failures: thundering herd (mutex), penetration (bloom), avalanche (jitter)

12. Database for System Design

Pool sizing: L = λW (Little's Law), or 2×cores + spindles
PgBouncer: transaction pooling mode for PostgreSQL
Read replicas: read-your-writes, monotonic reads patterns
Schema migration: gh-ost (binlog tailing), expand-contract
CREATE INDEX CONCURRENTLY to avoid write locks

Top 50 Database Rapid-Fire Q&A

#	Question	Answer
1	What does ACID stand for?	Atomicity, Consistency, Isolation, Durability
2	What does BASE stand for?	Basically Available, Soft state, Eventually consistent
3	State the CAP theorem	A distributed system can guarantee at most 2 of: Consistency, Availability, Partition tolerance
4	What is PACELC?	If Partition → choose A or C; Else → choose Latency or Consistency
5	Name the 4 SQL isolation levels	Read Uncommitted, Read Committed, Repeatable Read, Serializable
6	What is a dirty read?	Reading uncommitted data from another transaction
7	What is a phantom read?	New rows appear in a repeated range query due to another transaction's insert
8	What is write skew?	Two transactions read overlapping data, make decisions based on it, then write — violating a constraint neither saw
9	How does MVCC work?	Each row has multiple versions; readers see a snapshot without blocking writers
10	What is 2PL?	Two-Phase Locking: growing phase (acquire locks) + shrinking phase (release locks). Guarantees serializability
11	B-Tree vs B+ Tree?	B+ Tree stores data only in leaves and links leaves for range scans; B-Tree stores data in all nodes
12	Hash index vs B+ Tree?	Hash: O(1) exact match but no range queries. B+ Tree: O(log n) but supports range, sorting, prefix
13	What is a covering index?	An index that contains all columns needed by a query — avoids heap/table lookup entirely
14	Leftmost prefix rule?	Composite index (a, b, c) can satisfy queries on (a), (a,b), (a,b,c) but NOT (b), (c), or (b,c)
15	What is an LSM Tree?	Log-Structured Merge Tree: writes go to memtable → flush to sorted SSTables → background compaction
16	LSM vs B+ Tree tradeoffs?	LSM: faster writes, higher write amplification. B+: faster reads, slower writes, in-place updates
17	What is write amplification?	Ratio of actual bytes written to disk vs logical bytes written. LSM trees suffer from compaction-related amplification
18	What does EXPLAIN ANALYZE show?	The actual execution plan with real timing, row counts, and I/O stats — not just estimates
19	Sequential scan vs index scan?	Seq scan reads entire table. Index scan uses B+ tree to find specific rows. Index scan wins for selective queries (<5-15% of rows)
20	When is a sequential scan faster?	When reading most of the table (>15% of rows) — the random I/O cost of index lookups exceeds sequential read cost
21	Name 3 join algorithms	Nested Loop Join, Hash Join, Sort-Merge Join
22	When is Hash Join best?	Equi-joins on large tables where one side fits in memory (build hash table on smaller side)
23	What is a WAL?	Write-Ahead Log: all changes written to log before data pages. Enables crash recovery by replaying the log
24	Page size in PostgreSQL?	8 KB (fixed). Oracle uses 8KB default, MySQL/InnoDB uses 16KB default
25	What is the buffer pool?	In-memory cache of disk pages. Maps disk pages to memory frames. Uses clock-sweep (Postgres) or LRU (MySQL) for eviction
26	InnoDB vs MyISAM?	InnoDB: ACID, row locks, MVCC, crash recovery. MyISAM: table locks, no transactions, faster full-text (legacy)
27	Sync vs async replication?	Sync: leader waits for follower ACK (strong consistency, higher latency). Async: leader doesn't wait (risk of data loss on failover)
28	What is a quorum?	W + R > N ensures read sees latest write. Common: N=3, W=2, R=2
29	Raft vs Paxos?	Raft: easier to understand, strong leader. Paxos: more flexible, harder to implement, better liveness
30	What is a sloppy quorum?	During network partition, accept writes on available nodes (not necessarily the designated replicas). Hinted handoff transfers data back later
31	Range vs hash partitioning?	Range: preserves order (good for range queries, risk of hotspots). Hash: uniform distribution (no range queries)
32	What is consistent hashing?	Hash both keys and nodes onto a ring. Keys map to the next clockwise node. Adding/removing nodes only moves K/N keys
33	When to shard?	Last resort. First try: read replicas, caching, vertical scaling, query optimization, connection pooling
34	Redis persistence options?	RDB (point-in-time snapshots) and AOF (append-only log of every write). Use both for best durability
35	Why is Redis fast?	In-memory, single-threaded (no locks), epoll I/O multiplexing, efficient data structures (skiplists, ziplists)
36	MongoDB vs PostgreSQL?	MongoDB: flexible schema, horizontal scaling, document model. PostgreSQL: ACID, complex queries, joins, mature ecosystem
37	Cassandra partition key vs clustering key?	Partition key determines which node stores the data. Clustering key determines sort order within the partition
38	What is 2PC?	Two-Phase Commit: coordinator asks all participants to PREPARE, then sends COMMIT/ABORT. Blocking if coordinator fails
39	Saga pattern?	Break distributed transaction into local transactions with compensating actions. Choreography (events) or orchestration (coordinator)
40	Linearizability vs serializability?	Linearizability: real-time ordering of single operations. Serializability: transactions appear to execute sequentially
41	What is Google Spanner's TrueTime?	API returning time interval [earliest, latest]. Uses GPS + atomic clocks for <7ms uncertainty. Enables globally consistent transactions
42	5 caching strategies?	Cache-Aside, Read-Through, Write-Through, Write-Behind, Write-Around
43	Cache-aside vs read-through?	Cache-aside: app manages cache. Read-through: cache itself loads from DB on miss
44	What is a thundering herd?	Popular cache key expires → hundreds of requests simultaneously hit DB. Fix: mutex, request coalescing
45	What is cache penetration?	Repeated queries for non-existent data bypass cache and hit DB. Fix: bloom filter or cache null results
46	Connection pool sizing formula?	Little's Law: L = λW. Or HikariCP rule: 2 × cores + disk spindles
47	PgBouncer pooling modes?	Session (per connection), Transaction (per txn, recommended), Statement (per query)
48	What is gh-ost?	GitHub's online schema migration tool. Uses binlog tailing instead of triggers. Zero-downtime table alterations
49	Why CREATE INDEX CONCURRENTLY?	Standard CREATE INDEX locks writes for the entire build. CONCURRENTLY allows concurrent DML at the cost of 2-3x slower build
50	TimescaleDB hypertable?	Auto-partitioned table by time. Looks like one table, internally chunked. Enables efficient time-range queries and retention policies

SQL Quick Reference

JOIN Syntax

-- INNER JOIN: rows matching in both tables
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN: all rows from left + matching from right (NULL if no match)
SELECT c.name, o.id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

-- FULL OUTER JOIN: all rows from both, NULL where no match
SELECT c.name, o.id
FROM customers c
FULL OUTER JOIN orders o ON c.id = o.customer_id;

-- CROSS JOIN: cartesian product (every combination)
SELECT c.name, p.name
FROM customers c
CROSS JOIN products p;

-- SELF JOIN: table joined to itself (e.g., manager hierarchy)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Window Functions

-- ROW_NUMBER: unique sequential number per partition
SELECT name, department, salary,
       ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees;

-- RANK vs DENSE_RANK: RANK skips after ties, DENSE_RANK doesn't
SELECT name, salary,
       RANK() OVER (ORDER BY salary DESC) AS rank,           -- 1, 2, 2, 4
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank  -- 1, 2, 2, 3
FROM employees;

-- LEAD/LAG: access next/previous row's value
SELECT date, revenue,
       LAG(revenue, 1) OVER (ORDER BY date) AS prev_revenue,
       revenue - LAG(revenue, 1) OVER (ORDER BY date) AS daily_change
FROM daily_sales;

-- Running total / cumulative sum
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;

CTEs & Recursive CTEs

-- Common Table Expression (CTE)
WITH top_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(amount) > 10000
)
SELECT c.name, tc.total_spent
FROM top_customers tc
JOIN customers c ON tc.customer_id = c.id;

-- Recursive CTE: traverse an org hierarchy
WITH RECURSIVE org_tree AS (
    -- Base case: start from the CEO (no manager)
    SELECT id, name, manager_id, 0 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: find reports of current level
    SELECT e.id, e.name, e.manager_id, ot.depth + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT REPEAT('  ', depth) || name AS org_chart, depth
FROM org_tree
ORDER BY depth, name;

UPSERT (INSERT ON CONFLICT)

-- PostgreSQL UPSERT
INSERT INTO page_views (url, view_count, last_viewed)
VALUES ('/about', 1, NOW())
ON CONFLICT (url)
DO UPDATE SET
    view_count = page_views.view_count + 1,
    last_viewed = EXCLUDED.last_viewed;

-- MySQL UPSERT
INSERT INTO page_views (url, view_count, last_viewed)
VALUES ('/about', 1, NOW())
ON DUPLICATE KEY UPDATE
    view_count = view_count + 1,
    last_viewed = NOW();

Aggregation with HAVING & Grouping

-- HAVING filters groups (WHERE filters rows before grouping)
SELECT department, COUNT(*) AS emp_count, AVG(salary) AS avg_salary
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5 AND AVG(salary) > 80000
ORDER BY avg_salary DESC;

-- GROUPING SETS: multiple groupings in one query
SELECT department, region, SUM(salary) AS total
FROM employees
GROUP BY GROUPING SETS (
    (department, region),  -- by dept + region
    (department),          -- subtotal by dept
    ()                     -- grand total
);

Database Selection Decision Guide

Use this table during system design interviews to quickly justify your database choice:

Requirement	Recommended Database	Why
ACID transactions with complex queries	PostgreSQL	Full SQL, MVCC, strong consistency, rich extensions (PostGIS, TimescaleDB, pgvector)
ACID transactions with easy ops	MySQL / Aurora	Battle-tested, excellent replication, Aurora for managed auto-scaling
High write throughput, distributed	Cassandra	LSM tree (sequential writes), linear horizontal scaling, tunable consistency
Flexible schema with rich queries	MongoDB	Document model, aggregation pipeline, horizontal scaling via sharding
Caching / real-time leaderboards	Redis	In-memory, sub-ms latency, sorted sets, pub/sub, 100K+ ops/sec
Session store / rate limiting	Redis	Key expiry (TTL), atomic counters, distributed locks (Redlock)
Graph traversals (social, fraud)	Neo4j	Index-free adjacency, Cypher query language, efficient multi-hop traversals
Time-series data (metrics, IoT)	TimescaleDB / InfluxDB	Time-based partitioning, compression, continuous aggregates, retention policies
Full-text search	Elasticsearch	Inverted index, relevance scoring, fuzzy matching, faceted search
Global distribution with strong consistency	CockroachDB / Spanner	Distributed SQL, serializable isolation, automatic sharding, multi-region
Message queue / event streaming	Kafka (not a DB, but stores)	Durable log, exactly-once semantics, consumer groups, compacted topics
OLAP / analytics on large datasets	ClickHouse / BigQuery	Columnar storage, vectorized execution, massive parallel processing
Embedded database (mobile/edge)	SQLite	Serverless, zero-config, single-file, full SQL, used by every smartphone
Key-value with strong consistency	etcd / ZooKeeper	Raft consensus, linearizable reads/writes, service discovery, config storage
Multi-model (document + graph + search)	ArangoDB / FaunaDB	Single engine for multiple data models, reduces operational complexity

💡

Interview tip: In system design, always state your database choice with a reason. Don't just say "I'll use PostgreSQL." Say: "I'll use PostgreSQL because we need ACID transactions for payment processing, and the read:write ratio is 10:1 which suits its MVCC model. For the recommendation feed, I'll use Redis as a cache with a 5-minute TTL to reduce query load." Showing you understand the tradeoff is more important than the specific choice.

Master Notebook:Database Systems

ACID Properties

Atomicity

Consistency

Isolation

Durability

BASE Properties

ACID vs BASE Comparison

CAP Theorem

PACELC Extension

Database Types

Relational (SQL)

Document

Key-Value

Wide-Column (Column-Family)

Graph

Time-Series

Vector

Normalization

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF / 3.5NF)

Fourth Normal Form (4NF)

Fifth Normal Form (5NF)

When to Denormalize

Joins

INNER JOIN

LEFT (OUTER) JOIN

RIGHT (OUTER) JOIN

FULL OUTER JOIN

CROSS JOIN

SELF JOIN

Subqueries & CTEs

Correlated vs Non-Correlated Subqueries

EXISTS vs IN

Common Table Expressions (CTEs)

Window Functions

Query Optimization Patterns

The N+1 Query Problem

Other Optimization Patterns

B-Tree & B+ Tree

B-Tree

B+ Tree

Hash Index

LSM Tree (Log-Structured Merge Tree)

How LSM Trees Work

Specialized Indexes

Inverted Index (Full-Text Search)

GIN (Generalized Inverted Index)

GiST (Generalized Search Tree)

Covering Index (INCLUDE columns)

Partial Index

Composite Index & Index Strategy

Leftmost Prefix Rule

Index Selection Strategy

Isolation Levels

Read Anomalies

Dirty Read

Non-Repeatable Read

Phantom Read

Multi-Version Concurrency Control (MVCC)

PostgreSQL MVCC

MySQL/InnoDB MVCC

Locking & Two-Phase Locking (2PL)

Lock Types

Deadlock Detection & Prevention

Detection & Prevention

Query Pipeline

Cost-Based Optimizer

Join Algorithms

Nested Loop Join

Hash Join

Sort-Merge Join

EXPLAIN & ANALYZE

Key Scan Types

Pages & Blocks

Slotted Page Format

Buffer Pool / Cache

Write-Ahead Log (WAL)

Master Notebook:
Database Systems