MODULE 1

Fallacies of Distributed Computing

Eight assumptions that wreck systems. Memorize. Every design review: ask which one you're violating.

The Eight Fallacies

  1. The network is reliable — packets drop, links flap, partitions happen.
  2. Latency is zero — RTT inside region ~1 ms, cross-region 50–200 ms, satellite 600 ms+.
  3. Bandwidth is infinite — 10 Gbps NIC = 1.25 GB/s shared by all flows.
  4. The network is secure — assume hostile until TLS + authn + authz proven.
  5. Topology doesn't change — VMs migrate, DNS shifts, ASGs scale.
  6. There is one administrator — multi-team, multi-cloud reality.
  7. Transport cost is zero — egress is metered; serialization burns CPU.
  8. The network is homogeneous — different vendors, MTUs, jitter, ECMP.
MODULE 2

Failure Models

What kinds of failures must your protocol tolerate? Wrong assumption = unsound system.

Hierarchy

ModelAllowed faultsExample protocols
Crash-stopProcess halts; no recoverySimple replication
Crash-recoveryProcess halts then restarts (loses volatile state)Raft, Paxos with stable storage
OmissionMessages lost (send / receive)TCP retransmits, idempotent RPC
Network partitionSubset of nodes can't reach othersQuorum systems
ByzantineArbitrary behavior, including maliciousPBFT, Tendermint, blockchain

FLP Impossibility

In an asynchronous system with even one crash failure, no deterministic consensus algorithm can guarantee progress. Real systems sidestep with: timeouts (eventual synchrony), randomization, or partial synchrony assumptions.

MODULE 3

Consensus: Paxos, Multi-Paxos, Raft

Agree on a single value despite f failures from 2f+1 nodes.

Basic Paxos

  • Roles: Proposer, Acceptor, Learner. Often colocated.
  • Phase 1 (Prepare): Proposer picks ballot n > any seen. Sends Prepare(n) to majority. Acceptor promises not to accept < n; returns highest accepted (n', v').
  • Phase 2 (Accept): Proposer sends Accept(n, v) where v = highest v' or own value. Acceptor accepts iff hasn't promised higher.
  • Decision: Once majority accept (n, v), value chosen. Learners notified.

Raft

  • Strong leader: all writes go through leader; followers replicate.
  • Term = monotonically increasing election epoch.
  • Log entry = (term, index, command). Committed when replicated to majority.
  • Election: follower times out → becomes candidate → requests votes for new term → wins if majority.
  • Log matching: if two logs share an entry at index i, all entries before i are identical.
  • Safety: leader must contain all committed entries (election restriction).

ZAB & Variants

ZooKeeper Atomic Broadcast: total-order broadcast for primary-backup. Used in ZooKeeper. Similar safety to Raft, optimized for read-heavy workloads with watches.

MODULE 4

Leader Election & Leases

Single-writer for performance; lease for safety against split-brain.

Leases

  • Time-bounded grant of authority. Lease holder = leader for duration T.
  • If holder fails, others wait T then re-elect. No risk of two writers IF clocks are bounded-skew.
  • Renewable: holder refreshes before expiry.
  • Use case: Chubby/etcd lock service backing GFS master, Bigtable tablet leader.

Fencing Tokens

// Lock service issues monotonically increasing token on lease grant
client A → lock acquired, token=33
client A → GC pause 30s
client B → lease expired, lock acquired, token=34
client A → resumes, writes "hello" with token=33
storage → rejects: max seen token is 34
MODULE 5

Logical Clocks & Ordering

Without global wall clock, how do you order events?

Lamport Clocks

  • Each process keeps counter L. Increment on local event. On send, attach L. On receive, L = max(L, received) + 1.
  • Captures happens-before (→) but not concurrency: a → b ⇒ L(a) < L(b), but inverse not true.
  • Use total order via (L, process_id) tiebreak.

Vector Clocks

  • Each process i keeps vector V[1..N]. On local event: V[i]++. On send: attach V. On receive: V[j] = max(V[j], received[j]) for all j; then V[i]++.
  • Captures concurrency: a || b iff neither V(a) ≤ V(b) nor V(b) ≤ V(a).
  • Cost: O(N) per message. Variants: dotted version vectors (Riak), interval tree clocks.

Hybrid Logical Clock (HLC)

Combines physical time with logical counter. Provides one-way causality + tight bounds to wall clock. CockroachDB, YugabyteDB use HLC for MVCC ordering.

Spanner TrueTime

Google's API: TT.now() returns interval [earliest, latest] with bounded uncertainty (typically ~1 ms via GPS+atomic). Spanner waits out uncertainty to provide externally consistent (strict-serializable) transactions. Cost: extra latency = clock uncertainty.

MODULE 6

Replication Models

Where copies live, who can write, how updates propagate.

Modes

ModeLatencyDurabilityUse
Sync (primary waits for all replicas)Highest (slowest replica)HighestFinancial ledger
Quorum (W of N ack)Slowest of WTunableDynamo, Cassandra
Async (primary acks immediately)LowestRisk of loss on primary failureMySQL async replicas
Chain replicationO(N) for writeHighestFAWN, CRAQ

Quorum Math

R + W > N guarantees at least one replica seen the latest write at read time (strong consistency). R + W ≤ N = eventual. Common: N=3, R=2, W=2.

Chain Replication

Nodes in a chain: head → middle → tail. Writes start at head, propagate down chain. Reads from tail. Properties: strong consistency, throughput equal to single node, easy failure handling. CRAQ variant allows reads from any node with version check for higher read throughput.

MODULE 7

Consistency Models

Spectrum of guarantees. Pick the weakest that satisfies your invariant.

Spectrum (Strongest → Weakest)

ModelDefinitionCost
Strict serializable / Linearizable+SerializableReal-time order + serial scheduleSpanner-class — TrueTime / consensus
LinearizableSingle object, real-time orderQuorum + consensus
SerializableEquivalent to some serial order, no real-time guarantee2PL, SSI
Snapshot IsolationReads consistent snapshot, writes don't conflictMVCC; allows write skew
SequentialAll processes see same order; not necessarily real-timeTotal-order broadcast
CausalCausally related ops ordered; concurrent ops may differ across replicasVector clocks
Read-your-writes / Monotonic reads / Monotonic writes / Writes-follow-readsSession guaranteesSticky sessions / version pinning
EventualIf no new writes, replicas converge eventuallyCheapest; no ordering

CAP / PACELC

CAP: in a partition (P), choose Consistency or Availability. PACELC extends: when no partition (Else), choose Latency or Consistency. Example: Cassandra = AP + EL. Spanner = CP + EC.

MODULE 8

CRDTs

Replicated data types that converge automatically without coordination.

Common CRDTs

CRDTOpsMerge
G-CounterIncrementPer-replica vector; merge = element-wise max; value = sum
PN-CounterInc / DecTwo G-Counters: positives - negatives
G-SetAddUnion
2P-SetAdd, Remove (once)Adds ∪ Tombstones
OR-SetAdd, Remove (re-addable)Tag each add with unique ID
LWW-RegisterWriteHighest timestamp wins
RGA / LogootInsert / Delete charsCollaborative text editing

Production Uses

  • Riak — OR-Set for shopping carts.
  • Redis CRDB — counters across regions.
  • Figma / Linear / Notion — text + tree CRDTs for offline-first editing.
  • DynamoDB Streams + custom merge — convergent shopping cart.
MODULE 9

Distributed Transactions

When atomicity must span multiple services / shards.

Two-Phase Commit (2PC)

  • Coordinator → Prepare → all participants vote yes/no.
  • If all yes: Commit. If any no or timeout: Abort.
  • Blocking: if coordinator crashes between phases, participants stuck holding locks.
  • 3PC adds pre-commit phase for non-blocking under crash-stop, but assumes synchrony — rarely used in practice.

Sagas

  • Long-running business transaction = sequence of local txns + compensating actions.
  • Forward path: T1 → T2 → T3. If T3 fails, run C2, C1 to undo.
  • Orchestration (central state machine) vs choreography (event-driven). Orchestration easier to debug.
  • Not atomic — observers see partial states. Use semantic locks / status fields.

TCC (Try-Confirm-Cancel)

Reservation pattern: Try reserves resources without final commit. Confirm finalizes. Cancel releases. Used in payment + inventory systems.

Transactional Outbox

To publish event + commit DB row atomically: write event to outbox table in same DB txn. Separate process tails outbox, publishes to message bus, marks rows sent. Eliminates dual-write inconsistency.

MODULE 10

Anti-Entropy & Gossip

Background reconciliation when synchronous coordination is too costly.

Gossip Protocols

  • Epidemic spread: each round, node picks k random peers, exchanges state.
  • Convergence: O(log N) rounds for full propagation.
  • Resilient to churn; no central coordinator.
  • Used: Cassandra membership, Consul, Hashicorp Serf, Bitcoin block propagation.

Merkle Trees for Anti-Entropy

  • Hash leaves of dataset bottom-up. To compare two replicas, exchange root hash. If equal: identical. If different: descend differing subtrees.
  • Bandwidth: O(divergence) instead of O(dataset).
  • Used: Dynamo, Cassandra, IPFS, git pack negotiation.

Read Repair & Hinted Handoff

  • Read repair: on read, if replicas disagree, write back the latest version to stale replicas.
  • Hinted handoff: when target replica is down, coordinator stores hint locally and replays when target recovers.
MODULE 11

Cheat Sheet

Quick-reference for design discussions.

Pick a Consistency Model

  • Money / inventory → linearizable / serializable
  • Social feed → causal + RYW
  • Counter (likes, views) → eventual + CRDT
  • Collab editor → causal + CRDT
  • Sessions / cache → RYW + monotonic reads

Quorum Defaults

  • Strong: N=3, R=W=2
  • Read-heavy strong: N=5, R=2, W=4
  • Write-heavy strong: N=5, R=4, W=2
  • Cross-region eventual: per-region quorum + async replication

Atomicity Across Services

  • Same DB → ACID txn
  • DB + message bus → outbox
  • Multiple services, business txn → saga
  • Reservations → TCC
  • Avoid 2PC across microservices

Failure Detection

  • Heartbeat + timeout — false positives under load
  • Phi-accrual — adaptive timeout (Cassandra)
  • Gossip + suspicion — eventually consistent membership
  • Lease expiry + fence token — safe leader change

Consensus Choice

  • Need leader election + log → Raft / Multi-Paxos
  • Need k-v store with watches → etcd / ZooKeeper
  • Cross-region strong → Spanner-class (TrueTime / HLC)
  • Byzantine threat → PBFT / Tendermint (3f+1 nodes)

Numbers to Memorize

  • Same-AZ RTT ~0.5 ms
  • Cross-AZ RTT ~1–2 ms
  • Cross-region RTT 50–200 ms
  • Disk fsync ~1–10 ms (SSD)
  • Raft log replication: ~1 RTT to majority