The definitive interview preparation guide covering every high-level design pattern tested at FAANG & Big Tech companies — from back-of-envelope math to globally distributed case studies.
13 Modules • 25+ Visualizations • 7 Full Case Studies • Complete Cheat Sheet
Module 1
Foundations & Back-of-Envelope
Latency numbers, storage scales, bandwidth budgets, and the QPS math that turns a whiteboard sketch into a defensible capacity plan.
Latency Numbers Every Programmer Should Know
The single most used mental model in any system design interview is Jeff Dean's latency table. Internalize it to the point where you can reason in your head about whether a design will feel fast or feel laggy. Orders of magnitude matter more than exact digits: an L1 reference is a few nanoseconds, a main memory reference is a hundred nanoseconds, an SSD read is a hundred microseconds, a spinning disk seek is ten milliseconds, and a round trip across the Atlantic is about 150 milliseconds.
Latency numbers on a logarithmic scale
Those numbers are the raw material for every interview argument. When the interviewer says "we need to show the feed in under 200ms," you should immediately think: the transatlantic round trip is 150ms, so you can afford roughly one cross-region call and nothing else; everything else must come from an in-region cache or local memory. When they say "we need to serve a billion photos," you should think: each photo is maybe 300KB, so a billion photos is 300TB — that does not fit on one disk, so you need sharding or object storage.
The canonical latency table, expanded
Operation
Time
Relative
Interview implication
L1 cache reference
0.5 ns
1×
Free. Every instruction does several.
Branch mispredict
5 ns
10×
Matters for tight loops, not HLD.
L2 cache reference
7 ns
14×
Still "effectively free" at HLD.
Mutex lock/unlock
25 ns
50×
Fine for per-request; bad inside hot inner loops.
Main memory reference
100 ns
200×
Baseline for "in-memory data structure".
Compress 1KB with Snappy
3μs
6,000×
Cheaper than sending the uncompressed bytes over network.
Send 1KB over 1Gbps network
10μs
20,000×
Sibling pod inside a datacenter.
Read 4KB randomly from SSD
150μs
300,000×
A cache miss that hits disk still costs a round trip.
Read 1MB sequentially from memory
250μs
500,000×
Streaming memory is much faster than random.
Round trip within same datacenter
500μs
1,000,000×
Interview assumption: intra-DC = 0.5ms.
Read 1MB sequentially from SSD
1 ms
2,000,000×
Scanning a 1GB table from SSD = ~1s.
Disk seek (HDD)
10 ms
20,000,000×
Why rotating disks are allergic to random I/O.
Read 1MB sequentially from HDD
20 ms
40,000,000×
Sequential is fine; random is death.
Round trip CA → Netherlands
150 ms
300,000,000×
Speed of light floor for transatlantic user.
Why log scale matters
When you draw a latency bar chart on a linear axis, the cheap operations vanish. On a log axis you can see the whole spectrum at once. Candidates who draw linear latency charts in the interview immediately look like they are reasoning about performance for the first time.
Storage & Bandwidth Scales
Every design problem starts with three numbers: how many users, how many bytes per user, and how often. The rest is multiplication.
Byte scales you should memorize
1 character ≈ 1 byte (UTF-8 Latin).
1 tweet ≈ 200 bytes of text + metadata ≈ 1KB.
1 chat message ≈ 100 bytes text ≈ 500 bytes with envelope.
1 row in a normalized RDBMS table ≈ 100–500 bytes.
1 KV pair in Redis ≈ 100 bytes on the wire, 200 bytes in memory (dict overhead).
Bandwidth scales
Pipe
Peak bps
Per second
Per day
Home fiber
1 Gbps
125 MB/s
10.8 TB
Datacenter NIC
25 Gbps
3.1 GB/s
270 TB
Modern server (bonded)
100 Gbps
12.5 GB/s
1.08 PB
Top-of-rack switch
3.2 Tbps
400 GB/s
34.5 PB
Cross-region backbone
100 Gbps
12.5 GB/s
1.08 PB
QPS & Server Sizing
The back-of-envelope flow is always the same: users → requests per user per day → requests per second (QPS) → peak QPS → per-server capacity → server count. Then add headroom for failures and growth.
From DAU to peak QPS
Seconds in a day: 86,400 ≈ 105. Memorize this; it turns every daily number into a per-second number by dividing.
Average QPS = DAU × actions per user per day / 86,400.
Peak factor: diurnal traffic is roughly 2–5× the average. Use 3× as a safe default.
Write QPS typically 0.1–1% of read QPS for a social feed; 10–50% for chat; ~100% for IoT telemetry.
Worked example: Twitter-like feed
# Back-of-envelope for a Twitter-style feed
DAU = 300_000_000 # 300M daily active users
feed_opens_per_user = 20 # each opens the feed ~20x/day
read_qps_avg = DAU * feed_opens_per_user / 86_400
# = 300e6 * 20 / 86400 = ~70,000 QPS average reads
peak_qps = read_qps_avg * 3 # diurnal peak factor
# = ~210,000 QPS peak reads
# Per-server capacity
qps_per_server = 10_000 # a cache-backed read path comfortably does 10K QPS
servers = peak_qps / qps_per_server
# = 21 servers
# Add failure headroom: tolerate one AZ down in a 3-AZ region
servers_with_headroom = int(servers * 1.5)
# = 31 servers
# Regional: replicate per region for latency
regions = 5
total = servers_with_headroom * regions
# = 155 servers
print(f"Peak QPS: {peak_qps:,.0f} Total servers: {total}")
QPS → Server count calculator (step-through)
Per-server rules of thumb
Plain HTTP round-trip with light logic: 10,000–30,000 QPS on a modern 8-core VM.
Postgres OLTP: 1,000–5,000 QPS on a well-tuned r5.4xlarge; cross a disk, drop to ~200 QPS.
Redis (single-thread): 100,000 ops/s with small values, no persistence fsync in hot path.
Kafka broker: 500 MB/s sequential write; 1–2 million small messages/s.
Nginx L7 proxy: 50,000 RPS per core with keep-alive.
A complexity badge keeps your math honest: a rate limiter that does Redis operations can serve the 200k-QPS peak above with ~2 Redis primaries, not 20.
Estimation Playbook
Interviewers are not impressed by precision; they are impressed by structure. Walk through the same four steps every time and the numbers will be whatever they will be.
Users & usage. DAU, MAU, actions per user per day, read/write mix.
Data sizes. Bytes per item, number of items per user, retention period.
QPS & bandwidth. Divide by 86,400; multiply by peak factor; multiply by average bytes.
Capacity. Divide peak by per-server capacity; add headroom; add regions.
They will ask "why 10K QPS per server and not 100K?" Know your answer: a read that hits a local cache and streams 1KB is ~0.1ms of CPU plus a network round trip; 1 core can do ~10K of those per second with keep-alive. If you are hitting the database you should say "per-server QPS drops to ~1K because now we are blocking on disk." Having the reasoning chain ready converts a panic moment into a calm defense.
The estimation playbook shows up in every case study in Module 12. Before you read those, make sure the flow DAU → QPS → bytes → servers is something you can do at a whiteboard in 90 seconds without your notes.
Module 2
CAP, PACELC & Consistency Models
The theorems that tell you what you cannot have, plus the practical consistency knobs every distributed system exposes.
CAP Theorem
Eric Brewer's CAP theorem says that any networked shared-data system can guarantee at most two out of three: Consistency, Availability, and Partition tolerance. Because network partitions are a fact of life in any multi-machine system, the practical choice is always between CP (consistency + partition tolerance, sacrificing availability during partitions) and AP (availability + partition tolerance, sacrificing consistency during partitions).
CAP triangle with example systems
Reading CAP correctly
The most common misread is "Mongo is CP so it is always consistent" or "Cassandra is AP so it never is." Both are wrong. CAP describes behavior during a partition. In the common case, both types are consistent and available. The label only tells you what happens when the network fails.
CP under partition: the minority side refuses writes (and often reads) rather than diverge. Example: ZooKeeper quorum loss, MongoDB primary election freeze.
AP under partition: both sides keep serving writes, accepting that they will conflict later. Example: DynamoDB multi-region writes, Cassandra LOCAL_ONE.
CA only exists if you assume the network cannot fail, which is true inside a rack but not across a WAN.
PACELC Extension
CAP only describes partition behavior. PACELC (Daniel Abadi, 2010) fills in the common-case gap: if there is a Partition, choose Availability vs Consistency; Else, choose Latency vs Consistency. Every real database has a four-letter code.
System
On partition
No partition
Code
Spanner
Prefer consistency
Prefer consistency (TrueTime stall)
PC/EC
HBase
Prefer consistency
Prefer consistency
PC/EC
BigTable
Prefer consistency
Prefer consistency
PC/EC
MongoDB (default)
Prefer consistency
Prefer consistency
PC/EC
Cassandra (default)
Prefer availability
Prefer latency
PA/EL
DynamoDB
Prefer availability
Prefer latency
PA/EL
Riak
Prefer availability
Prefer latency
PA/EL
Amazon RDS (multi-AZ)
Prefer consistency
Prefer latency
PC/EL
PACELC clarifies why Cassandra can feel faster than Spanner even when nothing is broken: Cassandra reads locally, Spanner waits for TrueTime certainty. The latency cost of strong consistency in the good case is what you trade away when you pick CP.
Consistency Models
"Consistency" is an overloaded word. There is a spectrum from strongest (slowest, most expensive) to weakest (fastest, cheapest), and each point has a precise formal meaning.
The spectrum
Linearizable (aka atomic): operations appear to take effect at a single point in time between their invocation and response. There is one global order, and it respects real time. Example: Spanner, single-node Postgres, etcd.
Sequential consistency: there is a single global order but it need not respect wall-clock real time. Example: early distributed systems papers; rarely exposed directly in practice.
Causal consistency: operations that are causally related (A "happens-before" B) are seen in that order by every observer; concurrent operations may be reordered. Example: Riak with dotted version vectors, COPS.
Read-your-writes: after a client writes X, subsequent reads by the same client see X. Important UX property ("did my post actually post?"). Example: DynamoDB session consistency.
Monotonic reads: a client that has read version N never subsequently reads version N−k. Prevents UI flickering backward in time.
Eventual consistency: if writes stop, replicas converge. That is all it guarantees.
Linearizable vs eventual: two clients, one key
Quorum math
Dynamo-style systems expose R (read replicas to contact), W (write replicas to acknowledge), and N (replication factor). When R + W > N, you get strong consistency because the read and write quorums always overlap; a new write must be seen by at least one of the nodes the read contacts.
# Quorum consistency examples for N = 3 replicas
N = 3
# Strong consistency options (R + W > N)
configs = [
("W=3 R=1", 3, 1), # write-heavy: every write fans out, reads are fast
("W=2 R=2", 2, 2), # balanced: both reads and writes tolerate 1 node down
("W=1 R=3", 1, 3), # read-heavy: fast writes, slow/expensive reads
]
for name, W, R in configs:
strong = (R + W > N)
tolerates_write_fail = (N - W)
tolerates_read_fail = (N - R)
print(f"{name:10s} strong={strong} "
f"write-fails={tolerates_write_fail} "
f"read-fails={tolerates_read_fail}")
# W=3 R=1 strong=True write-fails=0 read-fails=2
# W=2 R=2 strong=True write-fails=1 read-fails=1 <-- typical
# W=1 R=3 strong=True write-fails=2 read-fails=0
Choosing the Right Model
The choice flows from the business invariant, not from an abstract preference.
Use case
Needed model
Why
Bank balance / ledger
Linearizable
Double-spend is a catastrophic failure. Serializable isolation required.
Airline seat booking
Linearizable (per seat)
Two customers cannot claim the same seat.
Inventory counter
Linearizable or counters with reservations
Overselling is a real dollar loss.
User profile edit
Read-your-writes + eventual
The user must see their own change; other viewers can lag.
Social feed timeline
Eventual + causal
A reply must never appear before its parent.
Like / view counter
Eventual
Off-by-a-few is fine; do not block on consensus.
DNS records
Eventual (long TTL)
Convergence is measured in minutes; perf matters more.
Distributed config
Linearizable (etcd/ZooKeeper)
All nodes must agree on "who is leader" at all times.
Notice that a single product often needs multiple models. Instagram's "like" counter can lag for minutes without anyone caring, but the "block list" must be strictly consistent or blocked users will see things they should not.
The practical rule
Default to eventual consistency, then add targeted strong-consistency surfaces where the business needs it. The reverse (start strong, weaken later) is expensive and error-prone because the whole stack already assumes a linearizable substrate. Module 6 shows how this maps to concrete database choices; Module 12 walks through how each case study picks consistency per feature.
Module 3
Scalability
Vertical vs horizontal, stateless services, session affinity, and when to reach for autoscaling.
Vertical vs Horizontal Scaling
Vertical scaling ("scale up") means making one machine bigger: more CPU, more RAM, faster disks. Horizontal scaling ("scale out") means adding more machines. The interview answer is almost always "horizontal, because vertical has a hard ceiling and a terrible failure story," but the real-world answer is often a combination.
Dimension
Vertical
Horizontal
Ceiling
Largest single box (a few TB RAM, hundreds of cores)
Effectively unbounded
Failure domain
One machine = whole system
One machine = small fraction
Complexity
Low — one box, one process
High — distributed coordination
Cost curve
Super-linear (big boxes are pricey per core)
Linear (commodity fleet)
State
Trivial — in-process
Must shard or externalize state
Typical use
Databases, in-memory analytics
Stateless web tier, microservices
The rule of thumb: always scale out the stateless parts first, then scale up the stateful parts until you hit a wall, then shard. Vertical scaling for the stateless tier is a cost mistake; horizontal scaling for the database as a reflex (without a good sharding key) is a correctness mistake.
Scaling cost vs capacity
Stateless Services
A stateless service keeps no per-client memory between requests. Every request carries (or looks up) everything it needs. The payoff is huge: any instance can serve any request, so load balancing is trivial, failover is instant, and horizontal scaling is a deployment setting.
What "stateless" actually means
No in-process session store. Sessions go to Redis or a signed cookie.
No local caches that affect correctness. Local caches as performance hints are fine.
No local file storage that must survive a deploy. Writes go to object storage.
No sticky state in memory across requests. If it must persist, it is in a database.
# Anti-pattern: in-process session
SESSIONS = {} # dies with the pod; cannot scale out
def handle_request(req):
user = SESSIONS.get(req.cookie("session_id"))
# Every subsequent request must go to the SAME pod. Fragile.
# Better: JWT or Redis session
import jwt, redis
r = redis.Redis(host="session-cluster")
def handle_request_stateless(req):
token = req.cookie("session_id")
# Option A: signed JWT - no server state at all
claims = jwt.decode(token, PUBLIC_KEY, algorithms=["RS256"])
# Option B: short opaque token -> Redis lookup
user_json = r.get(f"sess:{token}")
return claims or json.loads(user_json)
Sticky Sessions & State
Sometimes you cannot avoid state. A WebSocket server, a game server, or a transcoding pipeline holds a real conversation with one client. In those cases the load balancer uses sticky sessions (also called session affinity): once a client lands on a server, all subsequent requests from that client go to the same server.
Cookie-based affinity: the LB issues a cookie; the client returns it; LB routes accordingly.
IP-hash affinity: hash(client IP) → backend. Works poorly with NAT pools (many clients, one IP).
Connection-based: for long-lived connections (WebSocket), the TCP connection itself is sticky.
Stickiness is an anti-scale pattern: if one client is hot (a whale user in a chat app), that server is now hot. Use it only when you must, and always combine with draining: when a server is to be removed, stop giving it new sessions but let old ones finish.
Autoscaling
Autoscaling adds or removes instances based on load. The happy path looks simple; the failure modes are where interviewers probe.
Signals
CPU — simple, lagging. A 2-minute average of 70% means you are already hot.
Request rate / QPS — leading indicator if you know your per-server capacity.
Queue depth — the best signal for async workers. Depth per consumer = proxy for backlog time.
p99 latency — business signal, but noisy and high-cardinality.
Gotchas
Boot time. If a new pod takes 60s to boot, autoscaling is always late. Keep warm pools or trim images.
Thrashing. Without a cooldown or hysteresis, the cluster flaps between scale-up and scale-down.
Downstream blowback. Adding web pods can overwhelm the database. Scale the bottleneck, not the symptom.
Cold caches. Fresh pods have empty local caches; scale-up can actually increase p99 while caches warm.
# Simple HPA-style policy
def desired_replicas(current_cpu, target_cpu, current_replicas):
# Kubernetes HPA formula
ratio = current_cpu / target_cpu
return int(current_replicas * ratio + 0.5)
print(desired_replicas(80, 60, 10)) # 13 - scale up
print(desired_replicas(30, 60, 10)) # 5 - scale down
Module 4
Load Balancing
L4 vs L7, the algorithm zoo, health checks, and how modern systems handle regional and global traffic.
L4 vs L7
Load balancers split at two layers of the OSI stack. Know which one you need, because they solve very different problems.
Feature
L4 (TCP/UDP)
L7 (HTTP/gRPC)
Layer
Transport
Application
Inspects
IP + port + flags
URL, headers, cookies, body
Decisions per
Connection
Request (in a kept-alive connection)
TLS termination
Passthrough (or pass-through with SNI)
Terminates here
Latency overhead
10–50 μs
200–500 μs
Routing rules
Static backend pool
Path, header, cookie, weight, canary
Examples
AWS NLB, IPVS, HAProxy L4, envoy TCP
AWS ALB, nginx, Envoy HTTP, Traefik
Pick L4 for raw TCP/UDP (game servers, databases, legacy protocols) or when you need absolute throughput. Pick L7 whenever you need HTTP-aware routing: path-based microservices, canary weight, mTLS to backends, rate limiting per URL, WAF integration.
Modern clouds use both in layers: a global Anycast L4 terminates TCP near the user, a regional L7 parses HTTP and routes to services, and often a service mesh L7 sidecar (Envoy) handles mTLS and per-request retries. Each layer adds 100–500 μs; three layers are still under 2ms.
Algorithms
The load-balancer algorithm decides, for each incoming connection or request, which backend gets it. is the bar — the LB is on the hot path.
Round-robin vs least-connections vs consistent hash
The canonical algorithms
Round robin. Cycle through backends. Simple, fair if backends are identical, terrible if they are not.
Weighted round robin. Weights let you handle heterogeneous backends or stage canaries.
Least connections. Send to the backend with the fewest active connections. Good for long-lived connections with variable cost.
Least response time / least latency. Like least-conn but weighted by recent response time. Biases toward healthy backends.
IP hash. hash(client_ip) mod N. Sticky but terrible under NAT and when N changes.
Consistent hashing. Maps both keys and servers onto a ring; each key goes to the next server clockwise. Minimizes re-mapping when servers join or leave. Used for cache affinity (Memcached, Cassandra, CDN POPs).
Power of two random choices (P2C). Pick two random backends, route to the lighter. Gets 95% of the benefit of "pick the absolute least loaded" with 1% of the coordination. Used by Envoy, Finagle, YARP.
import random
import hashlib
from bisect import bisect
class RoundRobin:
def __init__(self, backends): self.bs, self.i = backends, 0
def pick(self, key=None):
b = self.bs[self.i % len(self.bs)]; self.i += 1
return b
class LeastConn:
def __init__(self, backends):
self.conn = {b: 0 for b in backends}
def pick(self, key=None):
return min(self.conn, key=self.conn.get)
def on_open(self, b): self.conn[b] += 1
def on_close(self, b): self.conn[b] -= 1
class ConsistentHash:
"""150 virtual nodes per backend balances ring density."""
def __init__(self, backends, vnodes=150):
self.ring = [] # (hash, backend)
for b in backends:
for v in range(vnodes):
h = int(hashlib.md5(f"{b}-{v}".encode()).hexdigest(), 16)
self.ring.append((h, b))
self.ring.sort()
self.hashes = [h for h, _ in self.ring]
def pick(self, key):
h = int(hashlib.md5(key.encode()).hexdigest(), 16)
i = bisect(self.hashes, h) % len(self.ring)
return self.ring[i][1]
class P2C:
def __init__(self, backends):
self.conn = {b: 0 for b in backends}
def pick(self, key=None):
a, b = random.sample(list(self.conn), 2)
return a if self.conn[a] <= self.conn[b] else b
Health Checks
A load balancer is only as good as its view of which backends are healthy. Two kinds of health checks:
Passive. The LB observes real traffic. A backend that returns 5xx or times out gets marked unhealthy (outlier ejection). Free, but slow to react to silent failures.
Active. The LB periodically hits a /healthz endpoint. Configurable interval (1–10s), timeout, and success threshold (e.g., "3 in a row").
Health check design rules
A /healthz that only checks "am I running?" hides deep failures. At minimum, check the most critical dependency.
But checking all dependencies causes correlated removals: if the DB is slow, every backend is unhealthy, and the fleet empties.
Split liveness ("restart me?") from readiness ("send me traffic?"). Kubernetes learned this the hard way.
Add jitter to the interval so checks do not synchronize and hammer a dependency.
# Balanced health check
@app.get("/healthz/live")
def live():
return {"ok": True} # only checks the process is up
@app.get("/healthz/ready")
def ready():
# Ready means: I can serve traffic RIGHT NOW.
checks = {
"db": _ping(db, timeout_ms=100),
"cache": _ping(cache, timeout_ms=50),
}
# Trick: if DB is down, stay ready so that graceful error pages still render.
# Become unready only if we have lost a local dependency we cannot recover from.
return {"ok": checks["cache"]}, 200 if checks["cache"] else 503
Global & Anycast
Global load balancing puts a user on the closest (or best) region. Three techniques:
DNS-based. www.example.com has different A records per region; Route53 / NS1 geolocation picks one based on the resolver's IP. Pros: simple. Cons: TTL-limited (60–300s minimum failover), and resolvers lie.
Anycast. The same IP is advertised from multiple POPs via BGP; packets take the BGP-shortest path. Pros: instant failover, no DNS. Cons: requires running your own AS.
Application-layer routing. A tiny front-door service does a geo-lookup and 302s the client. Pros: flexible. Cons: extra round trip.
Modern CDNs use Anycast on port 443 to land the user at the closest POP, terminate TLS there, and then use L7 to pick a backend region. That is how Cloudflare, Fastly, AWS Global Accelerator, and Google Cloud Load Balancing are wired.
Module 5
Caching Layers
Where caches live, which write pattern to pick, and how to survive invalidation, stampedes, and hot keys.
Cache Hierarchy
A modern request passes through up to five cache layers before it touches the database. Each one takes a bite out of latency and cost; each one also adds complexity and correctness risk.
Browser cache. Controlled by Cache-Control headers. Free egress. Stale risk: can live for days.
CDN / edge cache. Cloudflare, Fastly, CloudFront. Pop-local SSD. Reduces origin QPS by 80–99% for static assets.
Reverse proxy cache. Varnish / Nginx in front of the app. Micro-cache popular dynamic responses (1–5s).
Application cache. Redis / Memcached in the same region. Sub-millisecond.
Database buffer pool. The DB itself caches hot pages. Still one round trip but no disk.
Request flow through cache layers
Hit-rate math
Effective latency = p_hit × t_hit + (1 - p_hit) × t_miss. For a Redis-fronted database with 90% hit rate, 1ms hit, 50ms miss: 0.9 × 1 + 0.1 × 50 = 5.9ms. Improve hit rate to 99% and you get 0.99 × 1 + 0.01 × 50 = 1.49ms. The last few percent of hit rate carry the most latency value.
Write-Through / Back / Around
Every write to cached data forces a choice about who writes first and how.
Pattern
Write path
Pro
Con
Cache-aside (lazy)
App writes DB; invalidates cache; next read fills
Simple; resilient if cache dies
First read after write is slow; two round trips
Write-through
App writes cache; cache writes DB synchronously
Cache always warm and consistent
Every write pays double latency; DB outage blocks writes
Write-behind (write-back)
App writes cache; cache buffers DB write asynchronously
Fastest writes; coalesces updates
Data loss on cache crash; complex operator
Write-around
App writes DB; cache is not populated on write
Avoids cache churn on write-once data
Cold reads; only good when reads are rare or deterministic
Refresh-ahead
Cache proactively refreshes near TTL expiry
No cold reads; smooth latency
Extra load; needs access prediction
Cache-aside read/write dance
# Cache-aside pattern - the interview default
import redis, json
r = redis.Redis()
def get_user(user_id, db):
key = f"user:{user_id}"
cached = r.get(key)
if cached:
return json.loads(cached)
row = db.query_one("SELECT * FROM users WHERE id=%s", user_id)
if row:
# 5-minute TTL bounds staleness even if invalidation is missed
r.setex(key, 300, json.dumps(row))
return row
def update_user(user_id, patch, db):
db.execute("UPDATE users SET ... WHERE id=%s", user_id)
# Delete (not SET) - avoids a race where our stale SET beats
# another writer's SET.
r.delete(f"user:{user_id}")
Invalidation Strategies
"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton. Five mechanisms, each with a cost.
TTL. Every entry expires after N seconds. Cheapest mechanism. Upper bound on staleness = TTL.
Explicit delete on write. App deletes cache key whenever it updates the DB. Fragile under replication.
Versioned keys. Key contains a version: user:42:v17. On write, bump version. Old entries become unreachable and get evicted naturally.
Tag-based invalidation. Each key carries tags; a write invalidates all keys with a matching tag. Varnish, Fastly, and most CDNs expose this.
CDC-based invalidation. A change-data-capture pipeline (Debezium, DynamoDB Streams) emits a stream of DB changes; a consumer invalidates cache entries. Most robust; most complex.
Stampedes & Failures
Caches amplify problems as often as they solve them.
Cache stampede (dog-piling)
A hot key expires. Thousands of requests simultaneously miss, all hit the DB, and the DB melts. Solutions:
Single-flight / request coalescing: only one request per key goes to the DB; others wait on the same future.
Soft-TTL + jitter: start refreshing before expiry; add 10–20% random jitter so expiries do not synchronize.
Probabilistic early expiration: each read has a small, rising chance of triggering a refresh as TTL approaches.
# Single-flight (simplified) - Go's golang.org/x/sync/singleflight in Python
import threading
class SingleFlight:
def __init__(self):
self.lock = threading.Lock()
self.in_flight = {} # key -> Event
self.results = {} # key -> value
def do(self, key, fn):
with self.lock:
if key in self.in_flight:
ev = self.in_flight[key]
self.lock.release()
ev.wait()
self.lock.acquire()
return self.results[key]
ev = threading.Event()
self.in_flight[key] = ev
try:
val = fn()
with self.lock:
self.results[key] = val
del self.in_flight[key]
return val
finally:
ev.set()
Cache penetration
Queries for keys that do not exist always miss and go to the DB. Mitigation: negative caching (cache the miss with short TTL) and Bloom filters (reject known-non-existent keys before the DB round trip).
Hot key
One celebrity account gets 1M QPS; one Redis shard cannot handle it. Mitigation: key splitting (post:42#shard=random(0..9) replicated to 10 shards, aggregate on read), local in-process cache for ultra-hot entries, read replicas.
Cache avalanche
Redis cluster loses a node; hit rate drops from 99% to 60%; DB load 30×s. Mitigation: over-provision the DB for cache-empty worst case, or at least for 50% miss rate during failover.
Module 6
Databases in High-Level Design
SQL vs NoSQL vs NewSQL framed as a choice of workload, plus the replication and sharding decisions that follow.
SQL vs NoSQL vs NewSQL
Stop thinking of SQL/NoSQL as "old vs new." Think of them as different shapes of workload, each optimal for different access patterns.
Family
Shape
Strength
Weakness
Examples
Relational (OLTP)
Fixed schema, joins, transactions
Correctness, flexibility of query
Horizontal scaling is hard
Postgres, MySQL, SQL Server
Key-value
Opaque blobs by key
Blazing fast, trivially sharded
No query beyond the key
DynamoDB, Redis, RocksDB
Wide-column
Rows with dynamic columns, clustered by key
Huge write throughput, time-series friendly
Weak secondary indexes, no joins
Cassandra, HBase, Bigtable
Document
JSON-ish per document
Schema flex, single-doc ACID
Cross-document consistency is DIY
MongoDB, Couchbase, Firestore
Graph
Nodes + edges
Arbitrary-depth relationships
Batch analytics are awkward
Neo4j, Neptune, JanusGraph
Columnar / OLAP
Columns stored separately
Aggregations over billions of rows
Single-row lookups are slow
BigQuery, Snowflake, Redshift, ClickHouse
Search
Inverted index
Full-text, faceting, ranking
Not a source of truth
Elasticsearch, OpenSearch, Solr
NewSQL
SQL API on distributed storage
Scale + ACID + SQL
Heavy to operate, expensive
Spanner, CockroachDB, YugabyteDB, TiDB
Time-series
Append-only by time
High ingest, retention policies
Point lookups ok but not the sweet spot
InfluxDB, Prometheus, TimescaleDB
Ledger / immutable
Append-only, verifiable
Audit + tamper evidence
Niche, expensive
QLDB, Fauna
Replication
Replication keeps multiple copies of the data across machines or regions. It buys you durability, read capacity, and regional latency, but costs you write latency and storage.
Topologies
Single-leader (primary-replica). All writes go to the leader; reads fan out to followers. Simple; leader is a single point of failure until failover.
Multi-leader. Several regions can accept writes independently. Must resolve conflicts (LWW, CRDTs, application logic). Examples: CockroachDB multi-region, Active-Active Postgres via BDR.
Leaderless (quorum). Any replica accepts any op; correctness from R + W > N. Dynamo/Cassandra.
Synchronous vs asynchronous
Synchronous. Leader waits for replicas to ack before returning. Zero data loss on failover; write latency = round-trip to slowest required replica.
Asynchronous. Leader returns immediately; replicas catch up. Fast writes; potential data loss if leader dies before replication.
Semi-synchronous. At least one replica must ack. MySQL default. Good compromise.
If an interviewer says "zero RPO" (recovery point objective), that is code for "synchronous replication, minimum 2 replicas." If they say "5-minute RPO," asynchronous with reliable monitoring is fine.
Leader-follower replication timeline
Sharding
Sharding (horizontal partitioning) splits data across multiple databases by key. Each shard is independent, so reads and writes scale linearly — as long as your queries include the shard key.
Shard key strategies
Strategy
Formula
Pros
Cons
Range
key < A → shard 1
Range queries work; ordered iteration
Hotspots (recent time)
Hash
hash(key) mod N
Even distribution
No range queries; reshard is expensive
Consistent hash
Ring with virtual nodes
Minimal key movement on reshard
Complex; uneven without vnodes
Geo / directory
Lookup table
Flexible; compliance-friendly
Directory is a new SPOF
Cross-shard problems
Queries that do not include the shard key must scatter-gather to all shards. Avoid or maintain a secondary index.
Cross-shard transactions require two-phase commit or sagas. 2PC is slow and locks resources; sagas give up atomicity for availability.
Rebalancing when you add a shard can move enormous data. Consistent hashing and double-writes during migration keep it online.
Hot shards when one key is much more active. Use salted keys or sub-sharding.
# Consistent hashing with virtual nodes to avoid hot shards on reshard
import hashlib
from bisect import bisect
class ShardRing:
def __init__(self, shards, vnodes=256):
self.ring = [] # sorted (hash, shard)
for s in shards:
for v in range(vnodes):
h = int(hashlib.sha1(f"{s}:{v}".encode()).hexdigest(), 16)
self.ring.append((h, s))
self.ring.sort()
self._hashes = [h for h, _ in self.ring]
def shard_for(self, key):
h = int(hashlib.sha1(str(key).encode()).hexdigest(), 16)
i = bisect(self._hashes, h) % len(self.ring)
return self.ring[i][1]
def add_shard(self, new_shard, vnodes=256):
for v in range(vnodes):
h = int(hashlib.sha1(f"{new_shard}:{v}".encode()).hexdigest(), 16)
self.ring.append((h, new_shard))
self.ring.sort()
self._hashes = [h for h, _ in self.ring]
# On add: only ~1/N keys move, not all keys.
Decision Tree
A compact choose-your-DB heuristic for the interview whiteboard:
Database selection decision tree
Ninety percent of whiteboard problems land on Postgres + Redis + Kafka + object storage. Reach for Cassandra when write throughput is measured in millions per second and there is a natural partition key. Reach for NewSQL when SQL matters AND scale matters AND you can afford to operate it.
Module 7
Messaging & Streaming
Queues vs pub/sub, the Kafka log model, delivery semantics, and how backpressure stops cascading failures.
Queues vs Pub/Sub
Message-oriented middleware solves three very different problems, and the wrong choice shows up as mystery data loss six months later.
Pattern
Delivery
Consumers per message
Typical use
Examples
Work queue
One of N workers
Exactly one
Background jobs, image processing
SQS, RabbitMQ, Celery
Pub/sub topic
Broadcast
Every subscriber gets a copy
Event notifications, cache busting
Redis Pub/Sub, SNS, Pulsar topics
Event log
Append + replay
Any consumer can replay from any offset
Event sourcing, analytics, CDC
Kafka, Kinesis, Pulsar
Task scheduler
Time-delayed
Exactly one at time T
Cron, retries, reminders
EventBridge, Cloud Scheduler, Temporal
The big confusion: Kafka is not a queue. Kafka is a durable, replayable, ordered log with consumer offsets. A consumer can replay history, form groups, and share partitions — none of which a classic queue supports.
Kafka Log Model
Kafka is an append-only, partitioned, replicated log. Three words carry all the meaning:
Append-only. Writes are sequential disk I/O — 100× faster than random. Brokers easily sustain 500MB/s per drive.
Partitioned. Each topic is N partitions. Messages with the same key go to the same partition (and therefore are strictly ordered).
Replicated. Each partition has a leader and R-1 followers. A write acks after min.insync.replicas follow.
Kafka partition + consumer group rebalance
Partitioning discipline
Pick a partition key that balances load. user_id if users are uniform; tenant-salted if one tenant dominates.
Number of partitions sets parallelism. Consumers can be ≤ partition count in the group. Pick 3× expected max consumers.
Never change partition count lightly. It reshuffles key-to-partition mapping and breaks ordering guarantees in flight.
Retention matters. Default 7 days. Infinite retention turns Kafka into an event store; plan disk accordingly.
# Kafka producer - synchronous, with required acks
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=["broker-1:9092","broker-2:9092","broker-3:9092"],
acks="all", # wait for all ISRs
enable_idempotence=True, # dedup on retry
max_in_flight_requests_per_connection=5,
retries=10,
compression_type="snappy",
)
def publish_signup(user_id, payload):
# Key by user_id so all events for a user land on one partition
# and are consumed in order.
future = producer.send(
"user-signups",
key=str(user_id).encode(),
value=json.dumps(payload).encode(),
)
md = future.get(timeout=5)
# Returning offset lets the caller log "committed at offset 1234"
return (md.topic, md.partition, md.offset)
Delivery Semantics
Three levels. Each is a distinct engineering choice.
At most once: message may be lost, never duplicated. Fire-and-forget fits best for high-volume low-value telemetry where losing 0.1% is fine.
At least once: message always delivered, may be duplicated. The default for most queues. Consumers must be idempotent.
Exactly once: message delivered exactly once end-to-end. Achievable only with atomic commit across broker and consumer store. Kafka transactions + Kafka Streams provide it, but it is slower and more fragile.
Idempotency is the real answer
Even with "exactly once" semantics, network retries, operator reprocessing, and client timeouts will cause duplicates. Build the consumer to tolerate them.
Backpressure is how a slow downstream tells a fast upstream to slow down. Without it, the fast side fills its queues, spills to disk, runs out of memory, and crashes.
Credit-based. Downstream issues N credits; upstream sends at most N in flight. gRPC flow control, reactive streams.
Queue-depth-based. Upstream checks queue depth; throttles or sheds load when full. Shed earlier to avoid memory blow-ups.
Consumer lag. In Kafka, lag (committed offset vs latest offset) is the honest backpressure signal. Alert when lag exceeds SLA.
Reject with 429 / 503. At the ingress, return "too busy" with Retry-After and let the client back off.
Load shedding patterns
Priority queues: shed low-priority first.
Admission control: reject new work when at capacity; complete in-flight work.
Circuit breakers: stop calling a flaky downstream; fall back.
Bulkheads: isolate thread pools so one bad tenant cannot starve others.
# Token bucket admission control at the ingress
import time
class TokenBucket:
def __init__(self, capacity, refill_per_sec):
self.cap = capacity
self.refill = refill_per_sec
self.tokens = capacity
self.last = time.monotonic()
def take(self, n=1):
now = time.monotonic()
self.tokens = min(self.cap, self.tokens + (now - self.last) * self.refill)
self.last = now
if self.tokens >= n:
self.tokens -= n
return True
return False
tb = TokenBucket(capacity=1000, refill_per_sec=500)
def handle(req):
if not tb.take():
return Response(status=429, headers={"Retry-After": "1"})
return process(req)
Module 8
Storage
Object, block, file. Tiers from hot to glacial. Where the CDN fits and why durability is measured in nines.
Object / Block / File
Type
Access
Good for
Bad for
Examples
Object
HTTP GET/PUT by key
Blobs, backups, static assets, data lake
Partial updates, low-latency writes
S3, GCS, Azure Blob
Block
Attached volume (SCSI/NVMe-oF)
Databases, filesystems, VMs
Shared access, massive scale
EBS, Persistent Disk, local NVMe
File
POSIX over NFS/SMB
Shared mutable files, lift-and-shift apps
Massive scale, high concurrency
EFS, Filestore, Azure Files
Object storage is the default for anything bigger than ~1MB that does not need byte-level mutation. Block storage is for anything that needs a real filesystem beneath a real process (databases, caches with persistence). File storage is the "we could not refactor this legacy app" tier.
Durability numbers
S3 Standard: 11 nines (99.999999999%). 1 object lost per 10,000 objects per 10 million years.
EBS gp3: 5 nines annual AFR per volume. Snapshot to S3 for long-term durability.
Glacier Deep Archive: 11 nines; retrieval 12+ hours; $1/TB/mo.
Hot / Warm / Cold Tiers
Not all data deserves SSD. Move it down a tier and save 10× in cost per month.
Storage tier cost vs retrieval latency
Tier
Example (AWS)
$/GB/mo
Retrieval
Use case
Hot memory
ElastiCache Redis
~$0.50
<1ms
Session, top lists
Hot block
EBS gp3
$0.08
1ms
Database volumes
Hot object
S3 Standard
$0.023
Tens of ms
Images, web assets
Warm
S3 IA / One Zone-IA
$0.0125
Tens of ms
Backups, older logs
Cold
Glacier Instant
$0.004
Tens of ms (pay to retrieve)
Compliance, rarely read
Frozen
Glacier Deep Archive
$0.00099
12+ hours
Regulatory, tape replacement
Lifecycle rules automate the transition. A typical log retention policy: Standard for 30 days → IA for 90 days → Glacier for 7 years → delete. Move 1 PB of logs and you save ~$200k/year compared to keeping everything hot.
CDN & Edge
A CDN (content delivery network) is a globally distributed reverse-proxy + cache. POPs (points of presence) near users serve assets; a cache miss goes back to origin. The economics: origin egress is $0.09/GB, CDN egress is $0.01–0.05/GB, with 99% hit rate the savings dominate.
What belongs at the edge
Static assets (JS, CSS, images, video chunks): cache forever with content-hashed filenames.
API responses that are user-agnostic (catalog, search facets, trending): short TTL micro-cache.
Personalized responses: generally not cacheable, but edge compute (Cloudflare Workers, Lambda@Edge) can do per-user assembly near the user.
Uploads: some CDNs accept writes and copy to origin; reduces upload latency for global users.
Module 9
APIs
REST vs gRPC vs GraphQL, the idempotency discipline, and how rate limiters actually work.
REST
REST is the lingua franca of public APIs: resources as nouns, HTTP verbs as actions, JSON bodies, and a discipline of hypermedia that almost nobody practices. Pragmatic REST is what every SaaS exposes: GET /users/42, POST /users, DELETE /users/42, and a handful of custom verbs where the resource model gets awkward.
HTTP methods: GET/POST/PUT/PATCH/DELETE; PUT is idempotent, POST is not.
Status codes: 200 OK, 201 Created, 204 No Content, 400 bad request, 401 unauth, 403 forbidden, 404 not found, 409 conflict, 429 throttled, 5xx server.
Versioning: path (/v1/), header (Accept: application/vnd.co.v2+json), or query (?v=2). Path is least clever and therefore best.
Pagination: cursor-based (opaque tokens, stable) over offset (breaks under insert/delete).
Auth: OAuth2 for user, API keys for server-to-server, mTLS for internal.
# Cursor-based pagination beats offset for stability
@app.get("/events")
def list_events(cursor: str = None, limit: int = 50):
q = "SELECT * FROM events WHERE id > %s ORDER BY id LIMIT %s"
rows = db.query(q, (decode_cursor(cursor) or 0, limit))
next_cursor = encode_cursor(rows[-1].id) if len(rows) == limit else None
return {"items": rows, "next": next_cursor}
gRPC
gRPC is Google's schema-first RPC framework over HTTP/2 with Protocol Buffers. The trade-offs:
Strongly typed: codegen in 10+ languages, compile-time errors.
Binary wire format: 3–10× smaller than JSON, parses ~10× faster.
Streaming: server streaming, client streaming, bidi streaming all first-class.
HTTP/2 multiplexing: many concurrent calls per connection; no head-of-line blocking.
Not browser-friendly: needs gRPC-Web or Connect proxy for browsers.
Harder to debug: not human-readable; need a gRPC-aware tool (grpcurl, Postman).
Use gRPC for internal service-to-service where you control both ends and throughput matters; use REST for external APIs and browser-facing surfaces. If you need both, define the service in proto and generate a REST gateway.
GraphQL
GraphQL lets the client specify exactly the fields it wants, in one request. It solves the "mobile app needs 6 REST calls to render one screen" problem.
One endpoint, POST queries and mutations.
Typed schema: IDL describes every object and field.
Single round trip for nested data.
N+1 resolver problem: naive resolvers do one DB call per parent, which explodes under load; solved by DataLoader batching.
Caching is harder: no natural URL-keyed cache; responses often unique per client.
Security: queries can be arbitrarily deep; enforce depth limits and query cost budgets.
Idempotency
An operation is idempotent if applying it N times has the same effect as applying it once. Every write API that clients retry must be idempotent or you will get duplicate charges, double bookings, and angry customers.
Idempotency keys
The classic pattern: client generates a UUID per intent; server deduplicates.
# Server-side idempotency with a Redis dedup window
def create_payment(request, idem_key):
# Look up prior result within a 24h window
cached = r.get(f"idem:{idem_key}")
if cached:
return json.loads(cached) # return the original result
# Acquire a short lock to dedup concurrent retries
got_lock = r.set(f"idem:lock:{idem_key}", "1", nx=True, ex=10)
if not got_lock:
# Another worker is processing - wait, poll result
raise Retry(after_ms=100)
try:
result = charge_card(request)
r.setex(f"idem:{idem_key}", 86400, json.dumps(result))
return result
finally:
r.delete(f"idem:lock:{idem_key}")
Stripe's idempotency header is the industry reference: Idempotency-Key: uuid, 24h retention, returns the original response on repeat (including the status code), even if the parameters differ slightly. Different parameters on same key → 409 Conflict.
Rate Limiting
Every public API is rate-limited. The algorithms:
Token bucket rate limiter
Algorithms
Fixed window. count(key, minute_bucket). Simple, but allows 2× burst at window boundary.
Sliding window log. Keep last-60s timestamps in a sorted set; count. Accurate; memory-heavy.
Sliding window counter. Weighted average of current and previous window counter. Cheap approximation.
Token bucket. Tokens refill at R/sec up to C; each request takes 1. Allows bursts up to C. The workhorse.
Leaky bucket. Queue with fixed drain rate. Smoothes output (no bursts); may add latency.
# Redis sliding-window counter (approximate, cheap)
def allow(key, limit_per_min):
now = int(time.time())
cur_bucket = now // 60
prev_bucket = cur_bucket - 1
cur = int(r.get(f"rl:{key}:{cur_bucket}") or 0)
prev = int(r.get(f"rl:{key}:{prev_bucket}") or 0)
elapsed = now % 60
# Weighted rate: current counter + (1 - elapsed/60) of previous
rate = cur + prev * ((60 - elapsed) / 60)
if rate >= limit_per_min:
return False
r.incr(f"rl:{key}:{cur_bucket}")
r.expire(f"rl:{key}:{cur_bucket}", 120)
return True
Returning throttled responses
Always set X-RateLimit-Limit, X-RateLimit-Remaining, and Retry-After. Clients can self-pace; your support team stops receiving confused tickets.
Module 10
Search & Recommendations
Inverted indexes, ranking heuristics, vector search, and how modern recommenders glue them together.
Inverted Index
A database's B-tree answers "give me rows where id = 42." An inverted index answers "give me documents that contain the word database," which is a fundamentally different shape.
Term dictionary: sorted list of every term with document frequency.
Posting list: per term, list of (doc_id, term_frequency, positions), usually delta-encoded and compressed.
Skip lists: for fast AND / OR intersections without scanning every posting.
Inverted index lookup
Ranking: TF-IDF and BM25
Relevance = how well the document matches the query. The canonical formulas:
TF (term frequency): how often the term appears in the doc. More = more relevant.
IDF (inverse document frequency): rare terms are more informative than common ones.
BM25: TF-IDF with length normalization and a saturating TF curve. Default scorer in Lucene/Elasticsearch.
import math
def bm25(term_freq, doc_len, avg_doc_len, doc_freq, total_docs, k1=1.5, b=0.75):
idf = math.log((total_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1)
tf_norm = (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * doc_len / avg_doc_len))
return idf * tf_norm
# Per-doc score = sum of BM25 over query terms
Ranking & Relevance
Textual match is only the first pass. Production search layers two, sometimes three, ranking stages:
Candidate generation: lexical search returns top ~1000 docs using BM25. Fast, imprecise.
Feature-rich re-ranking: a learned model (GBDT, neural net) scores the 1000 using hundreds of features (freshness, clicks, personalization, authority, time-of-day).
The split is because re-ranking is expensive per doc (100 μs to 1 ms), so you want the candidate set to be small but not so small you lose recall.
Evaluation metrics
Precision@k: fraction of top-k results that are relevant.
Recall@k: fraction of all relevant results that are in top-k.
MRR (mean reciprocal rank): how high is the first relevant result on average.
NDCG (normalized discounted cumulative gain): sum of graded relevance, discounted by log position.
Vector Search
Embed documents and queries into N-dimensional vectors (typically 384–1536 dims) with a learned model; find nearest neighbors by cosine similarity or L2 distance. This catches synonyms and semantic matches that lexical search misses.
ANN (approximate nearest neighbor) algorithms
IVF (inverted file): cluster vectors into K buckets; search the closest M buckets. Good up to ~100M vectors.
HNSW (hierarchical navigable small world): multi-layer graph; O(log N) per query with high recall. The default in Pinecone, Weaviate, Qdrant, pgvector.
ScaNN / Annoy: tree-based, ideal for read-heavy static corpora.
Product Quantization: compress vectors 8–32×; trade recall for memory.
# pgvector query shape
"""
SELECT id, text, embedding <-> %s AS distance
FROM documents
ORDER BY embedding <-> %s
LIMIT 10;
"""
# With an HNSW index: ~5-20ms on 10M vectors, recall >= 95%.
Hybrid search
The best retrieval combines lexical (BM25) and vector (cosine) scores. Reciprocal Rank Fusion (RRF) is the simplest merge: for each result, score = sum(1 / (k + rank_in_list)). RRF works even when you cannot normalize the two scores.
Recommendations
Recommendation systems answer "what should this user see next?" The modern stack is similar to search: candidate generation → re-ranking.
Candidate generators: collaborative filtering ("users like you also liked"), content-based ("items similar to what you already liked"), popularity, trending, recent.
Two-tower model: user tower encodes user features, item tower encodes item features, dot product scores. Trained on clicks; served as an ANN index.
Re-ranker: a full-context model (gradient boosted tree or transformer) with hundreds of features per (user, item) pair.
Exploration: ε-greedy or Thompson sampling to show some non-obvious items so the system keeps learning.
Feedback loop trap
A recommender trained only on clicks over-rewards clickbait. Mix in explicit feedback (likes, follows, purchases, dwell time), negative signals (skips, hides), and diversity constraints.
Module 11
Observability
The three pillars, the SLI/SLO/SLA discipline, and what "good" on-call looks like.
Metrics / Logs / Traces
Observability is the superset of monitoring. You monitor known-unknowns (CPU, QPS, error rate); you observe unknown-unknowns (why did p99 spike for users on iOS 17 in Brazil at 03:15 UTC). The three pillars each answer a different question.
Pillar
Answers
Cost model
Cardinality
Tools
Metrics
Is anything wrong right now?
Cheap (aggregated time series)
Low (5–50 series per metric)
Prometheus, DataDog, CloudWatch
Logs
What exactly happened on this one event?
Expensive (every event stored)
High (free-form)
Elastic, Splunk, Loki, CloudWatch Logs
Traces
Where did this request spend its time?
Medium (sampled)
Medium
Jaeger, Tempo, X-Ray, DataDog APM
Distributed trace flame graph (span timeline)
Golden signals (SRE book)
Latency: how long requests take. Track p50, p95, p99 separately.
Traffic: QPS, bytes/s, active connections.
Errors: 5xx rate, 4xx rate, exceptions per second.
Saturation: CPU, memory, disk I/O, queue depth, DB pool utilization.
RED vs USE
Two mnemonic dashboards. RED (Rate / Errors / Duration) for request-driven services. USE (Utilization / Saturation / Errors) for resources (CPU, disk, queue). Most production teams wire both.
Three nested concepts. Get them right and your on-call rotation stops firefighting "my CPU is at 70%."
SLI (Service Level Indicator). A measurable thing. Example: "fraction of HTTP 200 responses under 300ms."
SLO (Service Level Objective). A target for the SLI. Example: "99.9% of requests over a 30-day window."
SLA (Service Level Agreement). Contract with customers, with teeth (refunds). Usually weaker than SLO.
Error budget
If your SLO is 99.9%, you are allowed to be "bad" 0.1% of the time — that is 43 minutes per 30-day window. That 43 minutes is your error budget. Spend it on deploys, experiments, and tolerated failures. If you blow it, stop shipping features and invest in reliability until the budget replenishes.
# Error budget burn rate
slo_target = 0.999
window_minutes = 30 * 24 * 60 # 43200
budget_minutes = window_minutes * (1 - slo_target) # 43.2
# A 10% burn in 1 hour means we'll exhaust the budget in 10 hours.
# Alert threshold: "fast burn" = 2% of budget in 1 hour = ~1% error rate for an hour.
Runbooks & On-Call
An alert without a runbook is a 3am puzzle. Every paging alert should link to a page that describes:
What this alert means in business terms ("checkout is broken for US East").
How to confirm (what dashboard, what log query).
How to mitigate (failover command, circuit breaker flip, feature flag off).
How to escalate (who owns this service, what Slack channel).
How to root-cause (trace query, recent deploys link).
Alert quality rules
Alert on symptoms, not causes. "Orders per minute dropped 50%" is a symptom; "db_conn_pool_exhausted=true" is a cause and may be noise.
Page only on things that need human action within 30 min. Everything else is a ticket.
No flapping. Alert hysteresis: require 3 consecutive breaches before paging; 5 consecutive normal to clear.
Postmortem every page. Was the alert justified? Was the runbook enough?
Module 12
Case Studies
Ten canonical interview problems: scope, high-level design, hardest sub-problem, back-of-envelope, and where each one commonly fails.
URL Shortener (bit.ly, TinyURL)
Scope. Accept a long URL, return a short alias (e.g. bit.ly/abc1234). On GET of the short URL, 301 to the original. Optional: per-link analytics, custom aliases, expiry.
Requirements. 100M new URLs/month = ~40 URLs/sec average, ~120/sec peak. Redirects are ~100× that: ~12k RPS peak. 5-year horizon = 6B total URLs. Read/write ratio 100:1. Redirect p99 < 100ms.
URL shortener high-level design
Hardest sub-problem: key generation
How do you generate a unique 7-character base-62 key without a central bottleneck? Three approaches:
Hash of URL (SHA-256 truncated to 7 chars). Fast, deterministic. Collisions at scale: birthday bound with 627 ≈ 3.5×1012 means after ~2M URLs, ~0.01% collision rate — need to probe and re-hash.
Counter-based. Central counter → base62-encode. One DB row is a bottleneck. Partition the counter (each shard owns a range).
Pre-generated key pool. Batch job pre-allocates keys in a table; API pops the next free one. Decouples write path from key generation.
The pragmatic answer in production: each write-API shard reserves a 1000-key range from a central atomic counter (so one DB write every 1000 URLs), then hands out keys in-process. Simple, fast, rebalances on pod restart.
# Key generation: range reservation per pod
import threading
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
class KeyAllocator:
def __init__(self, db):
self.db = db
self.lock = threading.Lock()
self.start = 0
self.end = 0
def _reserve(self, size=1000):
# Atomic: UPDATE counters SET next = next + size RETURNING next - size
row = self.db.fetch_one(
"UPDATE shortener_counter SET next = next + %s RETURNING next - %s",
(size, size))
self.start, self.end = row[0], row[0] + size
def next(self):
with self.lock:
if self.start >= self.end:
self._reserve()
v = self.start
self.start += 1
return self._encode(v)
@staticmethod
def _encode(n):
if n == 0: return "0"
digits = []
while n:
digits.append(BASE62[n % 62])
n //= 62
return "".join(reversed(digits)).rjust(7, "0")
Back-of-envelope
Storage: 6B URLs × 500 bytes (long URL + metadata) = 3 TB. Fits in a sharded Postgres or DynamoDB easily.
Read QPS: 12k peak. One Redis primary does that; replicate across 3 AZs.
Cache hit rate: power-law skew means top 10% of links drive 90% of traffic. Plan 95%+ hit rate.
Write QPS: 120 peak. Trivial for any DB.
Cost: <$500/month on AWS at this scale.
Failure modes
Custom aliases collide with generated keys; reserve a prefix (e.g., all generated keys start with 1, customs with a letter).
Malicious shorts: phishing, malware. Run safe-browsing checks before redirect; allow takedown.
Hot-link bursts: viral tweet points to one short. Cache-aside handles it; otherwise shard the single DB row.
URL shortener: read vs write path latencyConsistent-hash ring for KV shards
Deep dive: collision handling
If keys are SHA-truncated-to-42-bits, the birthday bound says collision probability hits 50% around 221 ≈ 2M keys. In practice you don't wait for 50%; you detect collisions on INSERT IF NOT EXISTS and re-hash with a salt. The cost is one extra round-trip on ~0.01% of writes — acceptable when writes are 120 QPS. For counter-based schemes collisions are impossible by construction; the bottleneck becomes the counter itself. The production answer is almost always: counter with range reservation → base62 encode. It sidesteps collision logic entirely and scales linearly with the number of write pods.
Deep dive: custom aliases vs base62
Custom aliases (bit.ly/my-event) share the same namespace as generated keys. Three robust strategies:
Prefix split: generated keys are 7 chars starting with a digit; customs must start with a letter. No runtime collision check.
Separate table: custom_aliases joined at lookup. Slower read but cleanly isolates the two flows.
Unified table with uniqueness constraint: on INSERT of a custom, fail if it collides with a generated key. Cleanest but requires probing.
Validation: customs must avoid the reserved prefix set (admin paths like /login, /api), satisfy length bounds, and pass a profanity filter to prevent brand damage. Case-folding: BitLy/Sale and bitly/sale should collide or not by explicit policy — Twitter shorts are case-preserving but lookup is case-insensitive; bit.ly is case-sensitive. Pick one and stick with it.
Deep dive: analytics sampling
Full per-click analytics at 12k RPS = 1B events/day. Write-every-click into a time-series DB is feasible but costly. Tiered approach: every click writes a Kafka event; a Flink/Spark Streaming job bucketizes into 1-min counts per key + per geo + per referrer; durable daily rollups land in a columnar store (ClickHouse, BigQuery). For P99 precision on long-tail links, use HyperLogLog for unique-visitor counts (2 KB of state gives <1% error for 109 distinct visitors). Cross-reference Module 11 caching for hot-key detection via Count-Min Sketch.
Twitter-like Feed
Scope. Users post short messages; each user has followers; home timeline shows posts from people they follow in reverse-chronological (or ranked) order. 300M DAU; average user follows 200 accounts; celebrities followed by 100M+; peak post rate 500/s; peak read rate 200k/s.
Twitter feed HLD (fan-out hybrid)
Hardest sub-problem: fan-out
Two models, each with fatal drawbacks at scale:
Fan-out on write (push). When A posts, copy the post ID to every follower's timeline list. Good: reads are O(1). Bad: if A has 100M followers, one post does 100M list inserts. Celebrities melt the system.
Fan-out on read (pull). On feed load, fetch latest posts from each followed account and merge. Good: writes are O(1). Bad: reads are O(followees) — 200 Redis calls per feed open.
Hybrid (the actual answer): fan-out on write for normal accounts (< 10k followers), fan-out on read for celebrities (>= 10k followers). Each user's feed is the merge of "pushed posts in my inbox" + "pulled posts from celebrities I follow." This is what Twitter and Instagram actually do.
# Feed read: merge pushed inbox with pulled celebrity timelines
def home_timeline(user_id):
pushed = redis.lrange(f"feed:{user_id}", 0, 200) # already merged timeline
celeb_followees = user_service.get_celeb_followees(user_id)
pulled = []
for c in celeb_followees:
# Each celeb has a cached recent list (small, shared by all followers)
pulled.extend(redis.lrange(f"user_posts:{c}", 0, 50))
combined = merge_by_time(pushed, pulled)[:200]
return post_service.hydrate(combined)
Back-of-envelope
Write amplification: 500 posts/s × average 200 followers = 100k Redis LPUSHs/s. One Redis shard with pipelining handles this; add capacity if celebs drift below the 10k threshold.
Read QPS: 200k/s feed opens × 1 Redis call per feed = 200k Redis ops/s. Shard by user ID across 8–16 primaries.
Timeline storage: 300M users × 200 post IDs × 8 bytes = 480 GB. Fits in Redis cluster.
Post storage: 500 × 86400 = 43M posts/day × 500 bytes = 21 GB/day = 7.8 TB/year. Cassandra or DynamoDB, partitioned by user.
Failure modes
Celebrity threshold tuning: too low → bad tail reads; too high → bad celeb writes. 10k is typical.
Timeline drift: if a push is lost, a user sees a hole. Occasional full rebuild from source DB.
Ranking: chronological is simple; "algorithmic" adds a re-ranker (see Module 10).
Spam & deletions: deleted posts must be filtered at read time; do not try to chase down 100M timeline entries.
Fan-out strategies contrasted (push vs pull vs hybrid)Feed generation pipeline (write -> fanout -> read)
Deep dive: celebrity hybrid threshold
The "10k follower" cutoff is the parameter that the entire system pivots around. Below the threshold the user is "push": writes fan-out to every follower inbox. Above, the user is "pull": reads for anyone who follows them must include a celebrity-timeline fetch. The right threshold depends on your Redis write budget. If pushers are bounded by W write-ops/s globally, and normal users average Fnorm followers, the threshold T satisfies posts_per_sec * F_norm < W. Empirically Twitter settled near 10k; Instagram is closer to 5k because stories fan-out is heavier. Below the threshold there are still outliers: a journalist with 9k followers who suddenly goes viral must be promoted to celebrity status without losing in-flight pushes. The operational pattern: maintain a "demotion queue" that freezes new fan-outs, flushes the old ones, then flips a flag that switches read path.
Deep dive: home timeline vs user timeline
Two distinct stores even though the words sound similar:
User timeline = "all posts I authored." Written once on post. Keyed by author_id. Useful for profile views and celebrity fan-out-on-read. Cassandra or DynamoDB, partitioned by author.
Home timeline = "merged feed shown to me." Written 200× per post due to fan-out. Keyed by viewer_id. Usually Redis-only (ephemeral); rebuild on cache miss by walking followees.
Conflating them in an interview is the most common anti-pattern. The home timeline is one-way derived data; it can always be rebuilt; losing a shard is recoverable. The user timeline is source-of-truth; losing it means lost posts.
Deep dive: ranked feed scoring
Once the merge candidate list exists (~500 items), re-rank by engagement prediction. Two-tower retrieval (embeddings) selects candidates; a gradient-boosted or small DNN model scores them. Features: recency decay, author affinity, post engagement velocity, content type (image/video/text), session signals. Budget: ~50ms for 500 candidates on a CPU inference server. Cross-ref Module 10 (Search & Recommendations) for the retrieve-rank-re-rank pattern.
WhatsApp / Chat
Scope. 1-to-1 and group chat. 2B users. 100B messages/day = ~1.2M messages/s average, ~3M/s peak. End-to-end encrypted. Online presence, read receipts, message history. Must work on flaky mobile networks.
A chat client is online when it holds a TCP (WebSocket/XMPP) connection to a gateway. To route "Alice → Bob," the system must know which gateway Bob is on. That mapping changes every time Bob's phone goes through a tunnel.
Session registry: user_id → gateway_id. Updated on connect/disconnect. Stored in Redis with short TTL, refreshed by heartbeat.
Message routing: sender's gateway looks up recipient's gateway, forwards message. If recipient is offline, enqueue to persistent store and schedule APNs/FCM push.
Ordering: within a chat, messages must appear in send order. Use a per-chat monotonic sequence number (source of truth on the server).
# Simplified message flow
def on_send(sender_conn, chat_id, body):
seq = chat_seq.incr(chat_id) # monotonic per chat
msg = {"chat": chat_id, "from": sender_conn.user_id, "seq": seq,
"body": body, "ts": time.time()}
msg_store.append(chat_id, msg) # durable log per chat
for uid in chat_members(chat_id):
gw = session_registry.get(uid)
if gw and gw != sender_conn.gateway_id:
router.send(gw, uid, msg) # forward via internal bus
elif gw == sender_conn.gateway_id:
gw_local_push(uid, msg)
else:
# offline - queue + push notification
offline_queue.push(uid, msg)
push_service.notify(uid, preview(msg))
Back-of-envelope
Connections: 1B concurrent online ÷ 1M connections per gateway = 1000 gateway pods. Each pod 16 cores, 32 GB RAM.
Message storage: 100B msg/day × 200 bytes = 20 TB/day. Retention depends on product policy; WhatsApp stores on client, server only queues for delivery.
Session registry: 2B entries × 50 bytes = 100 GB. Sharded Redis cluster with LRU on stale entries.
Peak send: 3M/s × average 10 recipients (group) = 30M router hops/s. Scale router horizontally; shard by chat_id.
Failure modes
Flaky networks: client must retry with same message-id; server dedups. Classic idempotency.
Group fan-out explosion: a 500-person group with one sender = 500 pushes. Rate-limit admins; cap group size.
Delivery ordering: out-of-order arrivals are resequenced by client using server-assigned seq number.
E2E encryption: server can see metadata (who → who, when) but not content. Complicates abuse reporting.
Session registry + multi-region routingMessage delivery timeline with offline queue
Deep dive: double-tick semantics
Four observable states on a sent message: sending (no network ack), sent (server persisted, single tick), delivered (recipient device acknowledged, double grey tick), read (recipient opened chat, double blue tick). The ticks are just enum fields on the message but the state machine is surprisingly subtle. Delivered ACK must be idempotent: if Bob's device re-connects and re-downloads, it mustn't re-ACK the same message ID and spam Alice's device. Read receipts are privacy-optional; if Bob disables them the server still tracks internally (for unread-badge counts) but suppresses the notification back to Alice. The source of truth is the per-chat max_seq_delivered and max_seq_read pair, stored per-participant, not per-message.
Deep dive: E2E encryption (Signal protocol)
WhatsApp uses the Signal protocol: X3DH for initial key exchange, Double Ratchet for forward secrecy, sesame for multi-device. Each message is encrypted with a per-message symmetric key derived from a chain, so compromising one key does not compromise history or future. Server sees envelope (sender id, recipient id, timestamp, size) but not plaintext. Implications for server design:
Server cannot dedup message bodies (each ciphertext looks random).
Group messages are still pairwise E2E: the sender encrypts N copies with N recipient keys. This is why a 500-member group does 500 encryptions on the sender device — bandwidth and CPU cost lives on the client.
Abuse / moderation must run on the client (Messages from Meta scans on-device) or on metadata.
Deep dive: group chat fan-out at 500+ members
Naive: sender encrypts 500 copies, uploads 500 ciphertexts, server routes to each recipient. At 500 members × 5 KB = 2.5 MB per send — fine on Wi-Fi, painful on cellular. Optimization: sender keys. Sender distributes a symmetric group key to each member once (pairwise E2E), then encrypts one ciphertext with the group key and uploads that once; server fans out the one ciphertext to all members. Forward secrecy is weaker (compromise of the group key exposes messages until rotation). Rotation: every N messages or when membership changes. This is how WhatsApp handles 1024-member groups without melting sender devices.
Uber Dispatch
Scope. Rider requests a car; system finds the nearest available driver; both get live ETA and map updates. 100M riders, 5M drivers; 15M rides/day; 10k dispatch decisions/s peak; location updates every 4s from each active driver = ~1M location updates/s.
Uber dispatch HLD
Hardest sub-problem: geospatial matching
For every rider request, find available drivers within ~3 km in under 100ms, in a city where there are thousands of drivers. The geospatial index is the whole game.
Grid cells (Uber H3 / Google S2). Earth tiled into hexagonal cells (H3) or square cells (S2). Each driver location maps to one cell ID. Query = "cells within radius R of rider cell."
R-tree / k-d tree. Classic spatial index, but worse for distributed systems because rebalancing is tricky.
Uber's actual approach: dispatcher nodes each own a set of H3 cells. Driver location updates route to the owning dispatcher (consistent hash on cell ID). When a rider requests, the API looks up the rider's cell and routes to the same dispatcher, which has all nearby drivers in local memory.
# H3-style dispatch sketch
import h3
def on_driver_update(driver_id, lat, lng):
cell = h3.geo_to_h3(lat, lng, resolution=9) # ~150m cells
dispatcher = cell_owner(cell)
dispatcher.upsert(driver_id, (lat, lng, time.time()))
def dispatch(rider_lat, rider_lng, radius_m=3000):
rider_cell = h3.geo_to_h3(rider_lat, rider_lng, resolution=9)
# k-ring at resolution 9 (~150m) = ceil(3000 / 150) = 20
nearby_cells = h3.k_ring(rider_cell, 20)
candidates = []
for c in nearby_cells:
owner = cell_owner(c)
candidates.extend(owner.drivers_in(c))
# Rank by ETA (driving distance, not straight line)
ranked = sorted(candidates, key=lambda d: eta_eta(rider_lat, rider_lng, d))
return ranked[:5]
Back-of-envelope
Location writes: 1M/s × 50 bytes = 50 MB/s. One Kafka topic at acks=1 handles it.
Dispatcher fleet: 5M drivers ÷ 10k per pod = 500 pods. Cells sharded by consistent hash.
Surge is a per-cell, per-time multiplier on the base fare. Compute it as surge = clamp(demand / supply, 1.0, 5.0) where demand = rider requests in the last 5 minutes and supply = available drivers. Implementation: streaming aggregation on Flink keyed by H3 cell; 30-second tumbling windows; exponential-moving-average smoothing to avoid flicker. Surge must be quoted up-front — a rider sees the multiplier before confirming. Once confirmed, the multiplier is locked on the trip row (snapshot pricing). If surge drops mid-trip the rider still pays the locked rate; if it rises they don't. Anti-gaming: cap surge per cell per hour; de-bias against drivers collectively going offline to manufacture surge (detected by correlating offline events with surge spikes).
Deep dive: ETA prediction pipeline
Two ETAs: "driver to rider" and "rider to destination." Both computed with a road-graph router (not straight-line). Offline batch builds the routable graph from OSM + proprietary corrections; online inference uses recent 5-minute traffic from GPS pings per edge. Feature vector per request: origin cell, dest cell, time-of-day one-hot, weather, day-of-week, edge-average-speed over last 15 min. Model: XGBoost typically, served as ONNX on CPU, sub-10ms per prediction. Cross-reference database Module 12 for streaming aggregation.
Deep dive: dispatch optimization as bipartite matching
Naive dispatch matches the first rider to the nearest driver greedily; this is the Hungarian algorithm's first iteration. But if you batch N riders and M drivers over a 3–5s window, you can solve min-cost bipartite matching globally — reducing average ETA across the batch by 5–15%. Uber's paper "Matching Statistics" calls this the "dispatching dispatch." Cost = ETA-to-pickup for each (rider, driver) pair; solve with Kuhn-Munkres in O(n3) — fine for N<1000 per batch. Failure mode: stale drivers (location ping older than 15s) get excluded to prevent assigning to a driver who's already moved. Cross-link: DSA Module 8 for graph matching.
Google Drive / Dropbox (storage)
Scope. Users upload files of any size; files sync to all their devices; files can be shared with other users. 1B users; average 10 GB stored per user = 10 EB total; 100k file ops/s peak.
File sync HLD
Hardest sub-problem: chunking, dedup, and resumable uploads
Uploading a 10 GB file over a flaky network will fail if you treat it as one request. The design splits into chunks (typically 4 MB); each chunk is stored once system-wide (content-addressed by hash).
Chunking: split file into 4 MB blocks. Compute SHA-256 per chunk. File metadata = ordered list of chunk hashes.
Dedup: blob store keyed by hash. PUT /blobs/<sha256> is a no-op if the blob already exists. A user re-uploading a shared PDF adds zero bytes globally.
Resumable uploads: client POSTs chunk by chunk; server tracks committed chunks; client can resume from first missing chunk.
Delta sync: on update, compute which chunks changed (rolling hash / rsync). Only push changed chunks.
import hashlib
CHUNK = 4 * 1024 * 1024
def upload_file(path, file_id):
manifest = []
with open(path, "rb") as f:
while chunk := f.read(CHUNK):
h = hashlib.sha256(chunk).hexdigest()
manifest.append(h)
if not blob_store.exists(h):
blob_store.put(h, chunk) # idempotent on hash
metadata_service.put_manifest(file_id, manifest)
Back-of-envelope
Storage: 10 EB nominal; dedup ratio 1.5–3×; realistically 3–6 EB on disk. S3-equivalent at $0.023/GB/mo = ~$100M–$200M/month just for storage. Cheap tier (Glacier) brings that 10× down for old files.
Metadata DB: 1B users × 1000 files × 200 bytes metadata = 200 TB. Sharded Spanner or CockroachDB; strongly consistent (a file must exist when metadata says it does).
Notify fan-out: a file change pushes to all the owner's and shared devices — typically <20. Cheap.
Failure modes
Concurrent edit conflict: two clients edit same file offline, then sync. Either last-write-wins with a conflict copy (Dropbox), or operational transform / CRDT for collaborative editing (Google Docs).
Large file on mobile: resumable upload with exponential backoff; defer upload to Wi-Fi.
Deleted blob referenced by shared link: reference count in metadata; garbage-collect blobs with zero refs after grace period.
Quota enforcement: must be globally consistent (otherwise a user can burst past quota on multiple devices). Pre-reserve bytes in metadata; commit on success.
Chunking + content-addressed dedup + CDNConflict resolution for concurrent edits
Deep dive: Merkle tree chunking
File manifest is really a Merkle tree: leaves = chunk hashes, internal nodes = hash of concatenated children, root = file identity. Benefits: (1) verifying a download only requires hashing each received chunk and comparing to the leaf hash; (2) delta sync between two file versions reduces to comparing tree roots, then drilling into subtrees that differ — O(log n) to find changed leaves; (3) cross-file dedup is free because the same chunk hash appears in any file that happens to contain that block. Dropbox uses a 4 MB fixed chunk size; Git (packfiles) uses similar machinery; rsync's rolling hash is a variant. The root hash also makes garbage collection auditable: orphan blobs are those whose hash appears in no file's tree.
Deep dive: content-defined chunking (FastCDC)
Fixed 4 MB chunks break down when a user inserts a single byte near the start of a large file — every downstream chunk shifts and gets re-hashed. Content-defined chunking solves this: cut the file at points where a rolling hash (like Rabin-Karp or Gear) of the last few bytes matches a target pattern. Insert a byte near the start → only the chunk containing the insertion boundary changes; all later chunks keep their boundaries and hashes. FastCDC is Microsoft's 2016 refinement that's ~10× faster than classic Rabin and is what most modern sync clients use. Average chunk size 2–8 MB, min 1 MB, max 16 MB; the tradeoff is dedup effectiveness (smaller chunks = better dedup, more metadata).
Deep dive: collaborative editing (OT vs CRDT)
Google Docs uses Operational Transformation: ops (insert, delete, format) are sent to the server, which applies them in a single serialized order; late-arriving ops are transformed against ops that landed first so intent is preserved. Needs a central authority (the Docs server). Figma, Yjs, Automerge use CRDTs: ops carry vector clocks or Lamport timestamps; merges are commutative so any peer can apply in any order and converge. CRDTs tolerate arbitrary topologies (P2P, offline) but their data structures are bigger (per-character metadata for text CRDTs like RGA). For a file sync product with diverse clients, CRDTs scale to more offline cases; for a cloud-first editor, OT is simpler. Cross-link: DSA Module 8 (graphs) for dependency-tracking CRDT internals.
YouTube (video)
Scope. Users upload video; others stream it globally. 2B monthly users; 500 hours uploaded per minute; 1B hours watched per day; p99 start-play < 1s.
YouTube HLD
Hardest sub-problem: transcoding pipeline
A 1-hour 4K raw upload is ~20 GB. Viewers watch on everything from a flip phone on 3G to a 4K TV on fiber. The platform must produce dozens of (resolution, bitrate) variants efficiently.
Chunked transcoding: split raw into 10-second segments; transcode each segment in parallel across a worker fleet. A 1-hour video = 360 segments × 10 variants = 3600 tasks; done in minutes on 500 machines.
Adaptive bitrate (ABR): HLS or DASH manifests list variants; player picks based on measured bandwidth; switches per segment without rebuffering.
Content-aware encoding: per-scene bitrate tuning (Netflix calls it "per-title encoding"); same perceived quality at 30–50% fewer bytes.
Stall recovery: player buffers ahead; if buffer drops below threshold, switch to lower bitrate.
# Pipeline orchestration sketch
def on_upload_complete(video_id, raw_blob_url):
segments = split_ffmpeg(raw_blob_url, segment_seconds=10)
tasks = []
variants = [
(3840, 2160, 15_000_000), # 4K
(1920, 1080, 4_500_000), # 1080p
(1280, 720, 2_000_000), # 720p
(854, 480, 800_000), # 480p
(640, 360, 500_000), # 360p
]
for seg in segments:
for (w, h, br) in variants:
tasks.append({"video": video_id, "seg": seg.idx,
"w": w, "h": h, "br": br, "in": seg.url})
queue.enqueue_batch("transcode", tasks)
# A separate job assembles the manifest when all variant segments are done.
Deep dive: recommendation pipeline (two-tower -> ranking -> re-ranking)
YouTube's recommender has three stages:
Candidate generation (two-tower): a user tower encodes (watch history, demographics, session) into an embedding; a video tower encodes (title, tags, creator, topic) into an embedding. Retrieval = approximate-nearest-neighbor over the video tower index (ScaNN or HNSW). Picks ~1000 candidates from 109 videos in <20 ms.
Ranking: a deep network scores (user, video, context) → P(watch), P(like), expected watch time. Scores the 1000 candidates in ~100 ms on TPU.
Re-ranking: diversification, freshness boost, policy filters (no re-showing just-watched, enforce creator diversity), bandit exploration mixing. Outputs final ~25 items.
Cross-ref Module 10 for the general retrieve-rank-re-rank pattern and vector DBs.
Deep dive: storage tiering (hot / warm / cold)
Viewing follows a power law: top 0.01% of videos drive 50% of watch-hours. Tiers:
Hot: last-90-day views > 100k. Kept on SSD at POPs globally. ~1% of catalog, ~60% of bytes served.
Warm: in regional object storage; fetched to POP on demand with a 200–500 ms origin-fill.
Cold: Glacier-class tape or SMR disk; 1–5 s first-byte; acceptable for the long tail (archive history).
Transitions are automated on view-count decay. Re-heat is triggered by a spike detector — if warm-tier requests cross a threshold in <10 min, promote back to hot.
Deep dive: thumbnail generation
On upload, extract ~100 candidate frames using a shot-boundary detector; score them with a CNN trained on click-through predictions; keep top 3; surface via A/B test bandit (Thompson sampling) so the best-performing thumbnail wins across actual viewers. The model also filters unsafe frames (nudity, violence). Thumbnail bytes live on the same CDN as the video; size ~50 KB per candidate. Cross-link: database Module 9 (NoSQL) for storing per-video thumbnail metadata.
Dropbox-style Sync (continuous)
Scope. Like Google Drive but focused on native desktop + mobile clients that keep local folders in sync. Think Dropbox, OneDrive, iCloud Drive. 500M users; sync latency p99 < 5s; must work on LAN-speed and dial-up alike.
Dropbox sync pipeline
Hardest sub-problem: sync coordination
A Dropbox client watches a local folder; when a file changes, it must push the change to the server, then the server must push to every other client of that user. And do this for 500M users without thundering-herd fan-out.
Block-level dedup (same as Drive). Only changed chunks move.
Notification channel: long-poll or WebSocket per device; server pushes "namespace X changed" — client pulls the diff.
Metadata journal: ordered log of operations per namespace (a folder tree). Clients keep a local copy of the journal cursor and replay new operations.
Conflict resolution: last-write-wins with a "conflict copy" file if both sides edited while offline.
# Client-side sync loop (conceptual)
def sync_loop(ns):
cursor = state.get_cursor(ns)
while True:
events = notify_service.long_poll(ns, cursor, timeout=60)
for ev in events:
if ev.kind == "added" or ev.kind == "modified":
chunks = metadata.get_chunks(ev.file_id)
for c in chunks:
if not local.has(c.hash):
blob = blob_store.get(c.hash)
local.write_chunk(c.hash, blob)
local.assemble(ev.file_id, chunks)
elif ev.kind == "deleted":
local.remove(ev.file_id)
cursor = max(cursor, ev.seq)
state.set_cursor(ns, cursor)
Back-of-envelope
Notify connections: 500M devices online ÷ 1M per long-poll pod = 500 pods. Same as chat.
Metadata DB: sharded by user namespace; journal is append-only so a KV store like FoundationDB fits.
Network savings: block-level dedup typically saves 30–50% bytes; rolling hash delta saves another 50–90% on edit-in-place files. Hugely important when users share 4GB video projects.
Cross-device sync latency: 3–5s p99 realistic; the bottleneck is usually the slower client's last-mile upload.
Failure modes
Sync loops: a client that incorrectly re-emits events can pummel the server. Add fingerprints and rate-limit per device.
Partial cross-platform compat: case sensitivity, filename legality, metadata. Normalize on the server; record original name.
Offline deletes vs remote edits: resolve with "delete wins after T or unless recently edited remotely." Defaults surprise users; make the rule visible.
Large first sync: a new device joining a 500 GB folder. Allow background sync; don't block UI.
Dropbox's "delta sync" splits files into 4 MB blocks (classically fixed-size; newer builds use content-defined chunking). On file save, the client hashes each block and sends only blocks whose hash is new to the server. When downloading, it fetches only the blocks it doesn't already have locally. A 2 GB Photoshop file save that modifies 10 MB of pixels uploads ~10 MB, not 2 GB. The server keeps an immutable block store keyed by content hash (see Google Drive section). Cross-user dedup saves global storage; per-user dedup (common on enterprise) prevents the "I can detect if you have file X" side-channel.
Deep dive: bandwidth throttling
A naive sync client will saturate the user's uplink and ruin their video call. Production clients throttle:
Per-connection upload cap (user-configurable: "Don't use more than 1 Mbps").
TCP congestion signal: watch RTT inflation on the control connection; back off uploads if latency climbs — the "LEDBAT" style (BitTorrent's approach).
Activity sensing: pause uploads during video calls (detected via OS APIs on macOS/Windows).
Priority queue: small files (docs, recently-opened) pushed before multi-GB backups.
Deep dive: LAN sync
Two devices of the same user on the same LAN can transfer directly instead of round-tripping to the cloud. Discovery via mDNS/Bonjour; a device broadcasts _dropbox-lansync._tcp and advertises which block hashes it holds. Peer asks "do you have block X?"; peer responds with block bytes over the LAN. Gigabit LAN >> 50 Mbps cable upload; a household that shares a 10 GB video between a laptop and a NAS finishes in ~80 s vs 30 min over cloud. Security: LAN sync uses the same E2E pairwise keys as cloud; a malicious LAN peer can't inject garbage because blocks are content-addressed. Cross-link: networking Module 5 for mDNS and zero-config discovery.
Netflix Streaming / Open Connect
Scope. 260M paying subscribers; 250M hours watched/day; catalog ~15k titles; 4K HDR up to 25 Mbps per stream; p99 start-play < 2 s globally, including subscribers in Jakarta and Lagos. Unlike YouTube, Netflix owns its CDN (Open Connect) end-to-end.
Requirements. 250M concurrent peak during global Squid-Game-type releases; 200 Tbps aggregate egress at peak; 95%+ traffic must be served from within-ISP cache appliances to keep cost and latency tolerable.
Open Connect push CDN architectureAdaptive bitrate ladder selection
Deep dive: Open Connect push model
Unlike Cloudflare or Fastly (pull CDNs, lazy fill on miss), Netflix pushes content before it's watched. Each night during the ISP's off-peak window, Netflix's "fill service" uploads the next day's hot-catalog to each Open Connect Appliance (OCA) — a 1U server sitting in an ISP's rack running FreeBSD + NGINX + nvme. Prediction model uses viewer history + time-of-day + day-of-week + global release calendar to pick what 10 TB of content each OCA should hold from a catalog that's ~200 TB encoded. Hit rate at the OCA: 95%+. The 5% misses are filled via IX peering back to AWS; the cost is tail-latency on rare content.
Deep dive: adaptive bitrate
Client player (built on the Netflix DASH spec) measures download time of each 4-second segment and estimates available bandwidth. Picks the highest rung whose bitrate < 80% of measured bandwidth (safety margin). Buffer target: 30 s forward buffer. If buffer drops below 10 s, aggressively step down; if it grows beyond 60 s, step up. The per-title ladder means each title has its own (resolution, bitrate) points — a 60 fps action scene gets 5.5 Mbps while a static talking-head documentary gets the same perceived quality at 2.5 Mbps. Cross-reference Module 4 (load balancing) for how Netflix steers clients to the "best" OCA via manifest URLs returned by the control plane.
Deep dive: predictive caching
Prediction features per OCA: (1) historical view counts at this OCA over trailing 7/28 days, (2) global trend velocity (is title X trending?), (3) editorial calendar (a new season of Stranger Things launches tomorrow), (4) recommendation spillover (if a subscriber finishes season 1, P(start season 2) is 90%). Cost function: minimize expected upstream fill bytes subject to per-OCA disk capacity. Solved nightly as a multi-knapsack problem per OCA. Wrong predictions = origin fill = degraded UX; over 10+ years Netflix has tuned this relentlessly.
Back-of-envelope
Peak egress: 200 Tbps. 95% from OCAs = 190 Tbps distributed across ~10,000 appliances = ~20 Gbps/OCA. Each OCA is a 2×100GbE box.
Storage per OCA: 10–30 TB NVMe. Catalog hot-set ~200 titles/day.
Nightly fill: 10 TB/OCA × 10k OCAs = 100 PB/night moved. That's why fills run 3 AM local — ISP backbones are empty.
Transcode cost: per-title ladder with 10 rungs × HDR variant × 3 codecs (AVC/HEVC/AV1) = 60 renders per title. 15k catalog = 900k render jobs. Run offline on spot instances.
Failure modes
OCA dies during prime time: traffic fails over to sibling OCA at same IX or to cloud-fill. Manifest has multiple CDN base URLs for automatic client-side failover.
Hot unpredicted release: "Baby Reindeer" goes viral overnight; prediction didn't flag it. Emergency fill triggered; cloud-fill backs the gap until OCAs converge.
ISP peering dispute: rare but real; bypass via alternative transit.
Codec fragmentation: old Smart TVs don't support AV1. Ship multiple codecs; pay the storage cost.
Stack Overflow-style Q&A
Scope. Public Q&A with reputation, tagging, voting, moderation, accepted answers. 100M MAU; 20M registered users; 25k new questions/day; 100k new answers/day; 2B page views/month; 700 RPS peak reads (heavily cached); full-text search across 25M questions.
Requirements. Strong consistency on reputation and vote counts (money-adjacent to users); tolerable staleness on view counts; full-text search P95 < 200 ms; moderation queue SLA 5 min for flagged content.
Q&A service topologyInverted index lookup for full-text search
Deep dive: reputation calculation
Every upvote on a user's answer = +10 rep; downvote = −2; accepted answer = +15 to answerer, +2 to asker; edit approval = +2. Computed per-vote in a transaction on the user_reputation table — strong consistency matters because reputation unlocks privileges (at 50 rep you can comment anywhere; at 3k you can moderate). Anti-cheat: daily cap (200 rep/day from votes); voting rings detected by correlating vote-graph anomalies offline. Reputation is denormalized onto the user row so that user-page renders don't aggregate every vote. A nightly job recomputes from vote history to catch drift (classic reconciliation pattern).
Deep dive: moderation queue
Flags land in mod_queue with priority = flag-weight × user-rep-of-flagger × recency. Automated rules auto-delete (spam ML classifier score > 0.95), auto-approve (rep > 10k user's edit to their own post), or route to human moderators. Humans see a sorted queue with context (post history, author rep, flag text). SLA 5 min because spam posts at high visibility damage brand. Mod actions are themselves logged and reviewable — a bad mod can be un-modded. Cross-link: Module 11 for audit trails.
Deep dive: tag subscriptions and inverted index
Each question carries 1–5 tags. Users subscribe to tags; "unanswered in my tags" is a primary use case. The subscription × tag index is a many-to-many store: tag -> subscribers and user -> tags. Lookup "unanswered questions in my tags" = union over tags of sorted posting lists, filter by is_answered=false. Materialized view + incrementally maintained on new question post. Full-text search uses Elasticsearch; each document is a post with fields (title, body, tags, answers.body); scoring is BM25 + reputation prior + recency decay. Cross-reference DSA Module 5 for Bloom filters (used to quickly rule out missing terms before hitting posting lists) and Module 10 for search stack architecture.
Back-of-envelope
Content size: 25M questions × 5 KB avg = 125 GB; 50M answers × 2 KB = 100 GB. Fits on one SQL box (Stack Overflow famously runs on 9 DB servers total).
Cache hit: 95%+ page views are anonymous views of popular pages — entire HTML page caches to CDN for 60s. Real DB reads: ~5% × 700 RPS = 35 RPS. Trivial.
Hot question: a viral post on HackerNews sends 50k concurrent viewers. CDN absorbs it; origin sees only a handful of revalidation requests.
Mod action race: two mods delete the same post. Idempotent action log; second mod sees "already deleted."
Search staleness: new question visible in search only after Elasticsearch indexes (usually <1 min). Acceptable. Read-your-own-writes served from SQL for own posts.
Scope. Users post ephemeral photos/videos that auto-expire after 24 hours. Viewers tap through friends' stories in sequence; the poster sees per-story viewer lists. 500M DAU post or view; 1B stories posted per day; peak 20M concurrent viewers; must feel instant (p95 first-frame < 500 ms).
Requirements. TTL data model (24h expiry, no retention beyond); viewer list per story updated in real time; image/video processing (1080p, multiple aspect ratios); stickers and effects pre-rendered server-side or client-side depending on complexity.
Stories data model and processingViewer-count aggregation at scale
Deep dive: TTL data model
A story row has a hard 24-hour TTL. Implementation: stored in a datastore that supports TTL natively (Cassandra with USING TTL 86400, or DynamoDB with a TTL attribute). At read time the DB checks expiry and elides rows. Blob store retains bytes for 48 h (grace window) before actual deletion — this protects against late-arriving eventual-consistency reads. Metadata TTL + blob TTL must be independent to avoid a race where metadata expires first and the blob is orphan-GC'd before anyone notices. Archive: if the poster opts into "archive all stories," a nightly job copies expiring rows into a permanent table with different access patterns.
Deep dive: multi-region push
When a poster in Jakarta uploads, her followers span every continent. Two design options:
Lazy pull: poster writes to her home region; readers fetch across regions on first view. Simple but 200–400 ms tail latency for distant readers.
Eager push: upload replicates metadata to all regions via a global pub/sub (Kafka MirrorMaker, Pulsar Geo-Replication, or Meta's proprietary Wormhole). Blob bytes replicate via CDN origin-pull or proactive push.
Instagram uses eager metadata + lazy blob via CDN. Metadata is tiny (a few KB); pushing it to every region is cheap. Blobs are big; let CDN edges pull them on demand. P95 first-frame < 500 ms requires the CDN to have the blob close to the viewer within seconds of upload — proactive CDN warm for followers in bulk regions.
Deep dive: viewer count at scale
A celebrity post can attract 10M viewers in an hour. Writing 10M rows (one per viewer) is fine, but reading back the viewer list (poster sees "Alice, Bob, ... 9.8M others viewed") needs efficient pagination. Solution: write events into Kafka partitioned by story_id; a Flink consumer per partition maintains a sorted Redis set (score = view timestamp). Poster's UI paginates through the set lazily. For the overall count, use HyperLogLog to dedupe repeat views cheaply; for the list, the authoritative source is Kafka log — nightly batch job reconciles against Redis for any drift. Cross-reference Module 7 for Kafka partitioning and DSA Module 5 for HyperLogLog.
Deep dive: image/video processing
On upload, server-side pipeline:
Validation: format sniffing, malformed-file guard (images crafted to exploit decoders).
Resize ladder: 1080p, 720p, 320p (thumbnail); WebP/HEIC encodes for modern clients, JPEG fallback.
Video transcode: 15-second clips at 720p60 and 480p30; H.264 for compatibility, HEVC for efficient streaming.
NSFW / PhotoDNA scan: run image classifiers for policy violations; route to moderation queue if triggered.
Stickers/effects: simple AR effects applied client-side on upload; complex ones (e.g., background blur with a segmentation model) run on-device using CoreML/MLKit; server receives the final render.
Pipeline is asynchronous: upload returns as soon as the original blob is durable (~100 ms) while downstream variants populate within a few seconds. Story can be "viewable" before all variants are ready; the player falls back to the original if a ladder rung isn't yet available.
Celebrity view burst: 10M viewers in minutes. Kafka partition for that story hot-spots; mitigate with sub-partitioning (hash of viewer_id). Cross-link: database Module 8 for hotspot mitigation.
TTL miss: story appears in user's feed but blob already GC'd. Retain blobs 48h (grace). Alert if read-after-expire rate > 0.01%.
Client re-view counting: user scrolls back; must not double-count. Dedupe in Flink with a per-viewer-per-story Bloom filter over the 24h window.
Moderation latency: offensive content spreads for minutes before human review. Use automated pre-screens (CNN classifier) that gate visibility on borderline content until human-cleared.
Module 13
Interview Cheat Sheet
Every table, every number, every one-liner you need to pull up in an interview, in one place.
Latency Reference
Operation
Time
Mnemonic
L1 cache
0.5 ns
"free"
Branch mispredict
5 ns
"10 L1s"
Main memory
100 ns
"RAM is not free after all"
Mutex lock/unlock
25 ns
Compress 1KB (Snappy)
3 μs
Send 1KB over 1Gbps
10 μs
SSD random 4KB read
150 μs
Read 1 MB RAM sequential
250 μs
Intra-DC round trip
500 μs
"half a millisecond"
Read 1 MB SSD sequential
1 ms
HDD seek
10 ms
"ten thousand of a second"
Read 1 MB HDD sequential
20 ms
Transcontinental round trip
~80 ms
"SF ↔ NYC"
Transatlantic round trip
~150 ms
"CA ↔ NL"
Capacity rules of thumb
Per-server QPS
Nginx static: 50k RPS per core
Node/Go HTTP: 10–30k RPS
Postgres OLTP: 1–5k QPS
Redis: 100k ops/s single-thread
Kafka: 500 MB/s sequential write
MySQL: 2–10k QPS
Bandwidth
Home fiber: 1 Gbps
DC NIC: 25–100 Gbps
TOR switch: 3.2 Tbps
WAN: 10–100 Gbps
Inter-region: 10–40 Gbps typical
Unit sizes
Tweet: ~200 B (1 KB w/ envelope)
Chat msg: ~500 B
URL: ~100 B
Photo: ~300 KB
1-min 1080p H.264: ~8 MB
1-min 4K H.265: ~30 MB
86,400 seconds in a day
Load Balancing Algorithms
Algorithm
How
Good for
Avoid when
Round robin
Rotate in order
Identical backends, short requests
Variable request cost
Weighted RR
Weights per backend
Heterogeneous fleet, canary
Never; always start here
Least connections
Pick fewest-active
Long-lived/variable requests
Stateless short HTTP
Least latency
Pick fastest recent
Mixed hardware, geographic mix
Slow signal, noisy
IP hash
hash(ip) mod N
Poor man's stickiness
NAT / mobile; rebalance pain
Consistent hash
Ring w/ vnodes
Cache affinity, sharded stateful
Perfect load balance needed
Power-of-two choice
Pick 2, take lighter
Large fleet, no coordination
Tiny fleet (<5)
Random
Any backend
Simplicity, no stats
Anything production
Cache Patterns
Pattern
Read path
Write path
Consistency
Use when
Cache-aside
Cache miss → DB → fill
DB write → cache delete
Eventually consistent
Default; most read-heavy services
Read-through
Always via cache library
Same as cache-aside
Eventually consistent
Framework support; cleaner code
Write-through
Cache hit
Sync cache + DB
Strong within cache
Low-churn reference data
Write-back
Cache hit
Async DB write
Risky on cache loss
Write-heavy hot counters
Write-around
Cache miss → DB → fill
DB only, skip cache
Same as cache-aside
Write-once read-rare
Refresh-ahead
Cache hit
Proactive refresh near TTL
Smoother p99
Predictable hot keys
Invalidation shortcuts
TTL: bound staleness, no bookkeeping.
Delete on write: fresh after the next read, but races with concurrent writes.