Master Notebook: Distributed Systems — FAANG Interview Guide

MODULE 1

Fallacies of Distributed Computing

Eight assumptions that wreck systems. Memorize. Every design review: ask which one you're violating.

The Eight Fallacies

The network is reliable — packets drop, links flap, partitions happen.
Latency is zero — RTT inside region ~1 ms, cross-region 50–200 ms, satellite 600 ms+.
Bandwidth is infinite — 10 Gbps NIC = 1.25 GB/s shared by all flows.
The network is secure — assume hostile until TLS + authn + authz proven.
Topology doesn't change — VMs migrate, DNS shifts, ASGs scale.
There is one administrator — multi-team, multi-cloud reality.
Transport cost is zero — egress is metered; serialization burns CPU.
The network is homogeneous — different vendors, MTUs, jitter, ECMP.

MODULE 2

Failure Models

What kinds of failures must your protocol tolerate? Wrong assumption = unsound system.

Hierarchy

Model	Allowed faults	Example protocols
Crash-stop	Process halts; no recovery	Simple replication
Crash-recovery	Process halts then restarts (loses volatile state)	Raft, Paxos with stable storage
Omission	Messages lost (send / receive)	TCP retransmits, idempotent RPC
Network partition	Subset of nodes can't reach others	Quorum systems
Byzantine	Arbitrary behavior, including malicious	PBFT, Tendermint, blockchain

FLP Impossibility

In an asynchronous system with even one crash failure, no deterministic consensus algorithm can guarantee progress. Real systems sidestep with: timeouts (eventual synchrony), randomization, or partial synchrony assumptions.

MODULE 3

Consensus: Paxos, Multi-Paxos, Raft

Agree on a single value despite f failures from 2f+1 nodes.

Basic Paxos

Roles: Proposer, Acceptor, Learner. Often colocated.
Phase 1 (Prepare): Proposer picks ballot n > any seen. Sends Prepare(n) to majority. Acceptor promises not to accept < n; returns highest accepted (n', v').
Phase 2 (Accept): Proposer sends Accept(n, v) where v = highest v' or own value. Acceptor accepts iff hasn't promised higher.
Decision: Once majority accept (n, v), value chosen. Learners notified.

Raft

Strong leader: all writes go through leader; followers replicate.
Term = monotonically increasing election epoch.
Log entry = (term, index, command). Committed when replicated to majority.
Election: follower times out → becomes candidate → requests votes for new term → wins if majority.
Log matching: if two logs share an entry at index i, all entries before i are identical.
Safety: leader must contain all committed entries (election restriction).

ZAB & Variants

ZooKeeper Atomic Broadcast: total-order broadcast for primary-backup. Used in ZooKeeper. Similar safety to Raft, optimized for read-heavy workloads with watches.

MODULE 4

Leader Election & Leases

Single-writer for performance; lease for safety against split-brain.

Leases

Time-bounded grant of authority. Lease holder = leader for duration T.
If holder fails, others wait T then re-elect. No risk of two writers IF clocks are bounded-skew.
Renewable: holder refreshes before expiry.
Use case: Chubby/etcd lock service backing GFS master, Bigtable tablet leader.

Fencing Tokens

// Lock service issues monotonically increasing token on lease grant
client A → lock acquired, token=33
client A → GC pause 30s
client B → lease expired, lock acquired, token=34
client A → resumes, writes "hello" with token=33
storage → rejects: max seen token is 34

MODULE 5

Logical Clocks & Ordering

Without global wall clock, how do you order events?

Lamport Clocks

Each process keeps counter L. Increment on local event. On send, attach L. On receive, L = max(L, received) + 1.
Captures happens-before (→) but not concurrency: a → b ⇒ L(a) < L(b), but inverse not true.
Use total order via (L, process_id) tiebreak.

Vector Clocks

Each process i keeps vector V[1..N]. On local event: V[i]++. On send: attach V. On receive: V[j] = max(V[j], received[j]) for all j; then V[i]++.
Captures concurrency: a || b iff neither V(a) ≤ V(b) nor V(b) ≤ V(a).
Cost: O(N) per message. Variants: dotted version vectors (Riak), interval tree clocks.

Hybrid Logical Clock (HLC)

Combines physical time with logical counter. Provides one-way causality + tight bounds to wall clock. CockroachDB, YugabyteDB use HLC for MVCC ordering.

Spanner TrueTime

Google's API: TT.now() returns interval [earliest, latest] with bounded uncertainty (typically ~1 ms via GPS+atomic). Spanner waits out uncertainty to provide externally consistent (strict-serializable) transactions. Cost: extra latency = clock uncertainty.

MODULE 6

Replication Models

Where copies live, who can write, how updates propagate.

Modes

Mode	Latency	Durability	Use
Sync (primary waits for all replicas)	Highest (slowest replica)	Highest	Financial ledger
Quorum (W of N ack)	Slowest of W	Tunable	Dynamo, Cassandra
Async (primary acks immediately)	Lowest	Risk of loss on primary failure	MySQL async replicas
Chain replication	O(N) for write	Highest	FAWN, CRAQ

Quorum Math

R + W > N guarantees at least one replica seen the latest write at read time (strong consistency). R + W ≤ N = eventual. Common: N=3, R=2, W=2.

Chain Replication

Nodes in a chain: head → middle → tail. Writes start at head, propagate down chain. Reads from tail. Properties: strong consistency, throughput equal to single node, easy failure handling. CRAQ variant allows reads from any node with version check for higher read throughput.

MODULE 7

Consistency Models

Spectrum of guarantees. Pick the weakest that satisfies your invariant.

Spectrum (Strongest → Weakest)

Model	Definition	Cost
Strict serializable / Linearizable+Serializable	Real-time order + serial schedule	Spanner-class — TrueTime / consensus
Linearizable	Single object, real-time order	Quorum + consensus
Serializable	Equivalent to some serial order, no real-time guarantee	2PL, SSI
Snapshot Isolation	Reads consistent snapshot, writes don't conflict	MVCC; allows write skew
Sequential	All processes see same order; not necessarily real-time	Total-order broadcast
Causal	Causally related ops ordered; concurrent ops may differ across replicas	Vector clocks
Read-your-writes / Monotonic reads / Monotonic writes / Writes-follow-reads	Session guarantees	Sticky sessions / version pinning
Eventual	If no new writes, replicas converge eventually	Cheapest; no ordering

CAP / PACELC

CAP: in a partition (P), choose Consistency or Availability. PACELC extends: when no partition (Else), choose Latency or Consistency. Example: Cassandra = AP + EL. Spanner = CP + EC.

MODULE 8

CRDTs

Replicated data types that converge automatically without coordination.

Common CRDTs

CRDT	Ops	Merge
G-Counter	Increment	Per-replica vector; merge = element-wise max; value = sum
PN-Counter	Inc / Dec	Two G-Counters: positives - negatives
G-Set	Add	Union
2P-Set	Add, Remove (once)	Adds ∪ Tombstones
OR-Set	Add, Remove (re-addable)	Tag each add with unique ID
LWW-Register	Write	Highest timestamp wins
RGA / Logoot	Insert / Delete chars	Collaborative text editing

Production Uses

Riak — OR-Set for shopping carts.
Redis CRDB — counters across regions.
Figma / Linear / Notion — text + tree CRDTs for offline-first editing.
DynamoDB Streams + custom merge — convergent shopping cart.

MODULE 9

Distributed Transactions

When atomicity must span multiple services / shards.

Two-Phase Commit (2PC)

Coordinator → Prepare → all participants vote yes/no.
If all yes: Commit. If any no or timeout: Abort.
Blocking: if coordinator crashes between phases, participants stuck holding locks.
3PC adds pre-commit phase for non-blocking under crash-stop, but assumes synchrony — rarely used in practice.

Sagas

Long-running business transaction = sequence of local txns + compensating actions.
Forward path: T1 → T2 → T3. If T3 fails, run C2, C1 to undo.
Orchestration (central state machine) vs choreography (event-driven). Orchestration easier to debug.
Not atomic — observers see partial states. Use semantic locks / status fields.

TCC (Try-Confirm-Cancel)

Reservation pattern: Try reserves resources without final commit. Confirm finalizes. Cancel releases. Used in payment + inventory systems.

Transactional Outbox

To publish event + commit DB row atomically: write event to outbox table in same DB txn. Separate process tails outbox, publishes to message bus, marks rows sent. Eliminates dual-write inconsistency.

MODULE 10

Anti-Entropy & Gossip

Background reconciliation when synchronous coordination is too costly.

Gossip Protocols

Epidemic spread: each round, node picks k random peers, exchanges state.
Convergence: O(log N) rounds for full propagation.
Resilient to churn; no central coordinator.
Used: Cassandra membership, Consul, Hashicorp Serf, Bitcoin block propagation.

Merkle Trees for Anti-Entropy

Hash leaves of dataset bottom-up. To compare two replicas, exchange root hash. If equal: identical. If different: descend differing subtrees.
Bandwidth: O(divergence) instead of O(dataset).
Used: Dynamo, Cassandra, IPFS, git pack negotiation.

Read Repair & Hinted Handoff

Read repair: on read, if replicas disagree, write back the latest version to stale replicas.
Hinted handoff: when target replica is down, coordinator stores hint locally and replays when target recovers.

MODULE 11

Cheat Sheet

Quick-reference for design discussions.

Pick a Consistency Model

Money / inventory → linearizable / serializable
Social feed → causal + RYW
Counter (likes, views) → eventual + CRDT
Collab editor → causal + CRDT
Sessions / cache → RYW + monotonic reads

Quorum Defaults

Strong: N=3, R=W=2
Read-heavy strong: N=5, R=2, W=4
Write-heavy strong: N=5, R=4, W=2
Cross-region eventual: per-region quorum + async replication

Atomicity Across Services

Same DB → ACID txn
DB + message bus → outbox
Multiple services, business txn → saga
Reservations → TCC
Avoid 2PC across microservices

Failure Detection

Heartbeat + timeout — false positives under load
Phi-accrual — adaptive timeout (Cassandra)
Gossip + suspicion — eventually consistent membership
Lease expiry + fence token — safe leader change

Consensus Choice

Need leader election + log → Raft / Multi-Paxos
Need k-v store with watches → etcd / ZooKeeper
Cross-region strong → Spanner-class (TrueTime / HLC)
Byzantine threat → PBFT / Tendermint (3f+1 nodes)

Numbers to Memorize

Same-AZ RTT ~0.5 ms
Cross-AZ RTT ~1–2 ms
Cross-region RTT 50–200 ms
Disk fsync ~1–10 ms (SSD)
Raft log replication: ~1 RTT to majority