Master Notebook:
System Design

The definitive interview preparation guide covering every high-level design pattern tested at FAANG & Big Tech companies — from back-of-envelope math to globally distributed case studies.

13 Modules • 25+ Visualizations • 7 Full Case Studies • Complete Cheat Sheet

Module 1

Foundations & Back-of-Envelope

Latency numbers, storage scales, bandwidth budgets, and the QPS math that turns a whiteboard sketch into a defensible capacity plan.

Latency Numbers Every Programmer Should Know

The single most used mental model in any system design interview is Jeff Dean's latency table. Internalize it to the point where you can reason in your head about whether a design will feel fast or feel laggy. Orders of magnitude matter more than exact digits: an L1 reference is a few nanoseconds, a main memory reference is a hundred nanoseconds, an SSD read is a hundred microseconds, a spinning disk seek is ten milliseconds, and a round trip across the Atlantic is about 150 milliseconds.

Latency numbers on a logarithmic scale

Those numbers are the raw material for every interview argument. When the interviewer says "we need to show the feed in under 200ms," you should immediately think: the transatlantic round trip is 150ms, so you can afford roughly one cross-region call and nothing else; everything else must come from an in-region cache or local memory. When they say "we need to serve a billion photos," you should think: each photo is maybe 300KB, so a billion photos is 300TB — that does not fit on one disk, so you need sharding or object storage.

The canonical latency table, expanded

Operation	Time	Relative	Interview implication
L1 cache reference	0.5 ns	1×	Free. Every instruction does several.
Branch mispredict	5 ns	10×	Matters for tight loops, not HLD.
L2 cache reference	7 ns	14×	Still "effectively free" at HLD.
Mutex lock/unlock	25 ns	50×	Fine for per-request; bad inside hot inner loops.
Main memory reference	100 ns	200×	Baseline for "in-memory data structure".
Compress 1KB with Snappy	3μs	6,000×	Cheaper than sending the uncompressed bytes over network.
Send 1KB over 1Gbps network	10μs	20,000×	Sibling pod inside a datacenter.
Read 4KB randomly from SSD	150μs	300,000×	A cache miss that hits disk still costs a round trip.
Read 1MB sequentially from memory	250μs	500,000×	Streaming memory is much faster than random.
Round trip within same datacenter	500μs	1,000,000×	Interview assumption: intra-DC = 0.5ms.
Read 1MB sequentially from SSD	1 ms	2,000,000×	Scanning a 1GB table from SSD = ~1s.
Disk seek (HDD)	10 ms	20,000,000×	Why rotating disks are allergic to random I/O.
Read 1MB sequentially from HDD	20 ms	40,000,000×	Sequential is fine; random is death.
Round trip CA → Netherlands	150 ms	300,000,000×	Speed of light floor for transatlantic user.

Why log scale matters

When you draw a latency bar chart on a linear axis, the cheap operations vanish. On a log axis you can see the whole spectrum at once. Candidates who draw linear latency charts in the interview immediately look like they are reasoning about performance for the first time.

Storage & Bandwidth Scales

Every design problem starts with three numbers: how many users, how many bytes per user, and how often. The rest is multiplication.

Byte scales you should memorize

1 character ≈ 1 byte (UTF-8 Latin).
1 tweet ≈ 200 bytes of text + metadata ≈ 1KB.
1 chat message ≈ 100 bytes text ≈ 500 bytes with envelope.
1 photo ≈ 300KB JPEG (phone camera, 12MP compressed).
1 minute of 1080p H.264 video ≈ 6–10 MB.
1 minute of 4K H.265 ≈ 25–40 MB.
1 hour of mp3 audio ≈ 30 MB.
1 URL ≈ 100 bytes; shortened key ≈ 7–8 bytes.
1 row in a normalized RDBMS table ≈ 100–500 bytes.
1 KV pair in Redis ≈ 100 bytes on the wire, 200 bytes in memory (dict overhead).

Bandwidth scales

Pipe	Peak bps	Per second	Per day
Home fiber	1 Gbps	125 MB/s	10.8 TB
Datacenter NIC	25 Gbps	3.1 GB/s	270 TB
Modern server (bonded)	100 Gbps	12.5 GB/s	1.08 PB
Top-of-rack switch	3.2 Tbps	400 GB/s	34.5 PB
Cross-region backbone	100 Gbps	12.5 GB/s	1.08 PB

QPS & Server Sizing

The back-of-envelope flow is always the same: users → requests per user per day → requests per second (QPS) → peak QPS → per-server capacity → server count. Then add headroom for failures and growth.

From DAU to peak QPS

Seconds in a day: 86,400 ≈ 10⁵. Memorize this; it turns every daily number into a per-second number by dividing.
Average QPS = DAU × actions per user per day / 86,400.
Peak factor: diurnal traffic is roughly 2–5× the average. Use 3× as a safe default.
Write QPS typically 0.1–1% of read QPS for a social feed; 10–50% for chat; ~100% for IoT telemetry.

Worked example: Twitter-like feed

# Back-of-envelope for a Twitter-style feed
DAU = 300_000_000               # 300M daily active users
feed_opens_per_user = 20        # each opens the feed ~20x/day
read_qps_avg = DAU * feed_opens_per_user / 86_400
# = 300e6 * 20 / 86400 = ~70,000 QPS average reads

peak_qps = read_qps_avg * 3     # diurnal peak factor
# = ~210,000 QPS peak reads

# Per-server capacity
qps_per_server = 10_000         # a cache-backed read path comfortably does 10K QPS
servers = peak_qps / qps_per_server
# = 21 servers

# Add failure headroom: tolerate one AZ down in a 3-AZ region
servers_with_headroom = int(servers * 1.5)
# = 31 servers

# Regional: replicate per region for latency
regions = 5
total = servers_with_headroom * regions
# = 155 servers

print(f"Peak QPS: {peak_qps:,.0f}  Total servers: {total}")

QPS → Server count calculator (step-through)

Per-server rules of thumb

Plain HTTP round-trip with light logic: 10,000–30,000 QPS on a modern 8-core VM.
Postgres OLTP: 1,000–5,000 QPS on a well-tuned r5.4xlarge; cross a disk, drop to ~200 QPS.
Redis (single-thread): 100,000 ops/s with small values, no persistence fsync in hot path.
Kafka broker: 500 MB/s sequential write; 1–2 million small messages/s.
Nginx L7 proxy: 50,000 RPS per core with keep-alive.

A complexity badge keeps your math honest: a rate limiter that does Redis operations can serve the 200k-QPS peak above with ~2 Redis primaries, not 20.

Estimation Playbook

Interviewers are not impressed by precision; they are impressed by structure. Walk through the same four steps every time and the numbers will be whatever they will be.

Users & usage. DAU, MAU, actions per user per day, read/write mix.
Data sizes. Bytes per item, number of items per user, retention period.
QPS & bandwidth. Divide by 86,400; multiply by peak factor; multiply by average bytes.
Capacity. Divide peak by per-server capacity; add headroom; add regions.

# Reusable estimation helper
from dataclasses import dataclass

@dataclass
class Estimate:
    dau: int
    actions_per_user: float
    bytes_per_action: int
    retention_days: int
    peak_factor: float = 3.0
    qps_per_server: int = 10_000
    ha_multiplier: float = 1.5
    regions: int = 1

    @property
    def avg_qps(self):
        return self.dau * self.actions_per_user / 86_400

    @property
    def peak_qps(self):
        return self.avg_qps * self.peak_factor

    @property
    def daily_bytes(self):
        return self.dau * self.actions_per_user * self.bytes_per_action

    @property
    def total_storage_bytes(self):
        return self.daily_bytes * self.retention_days

    @property
    def servers(self):
        return int((self.peak_qps / self.qps_per_server) * self.ha_multiplier * self.regions)

    def report(self):
        return {
            "avg_qps": round(self.avg_qps),
            "peak_qps": round(self.peak_qps),
            "daily_gb": round(self.daily_bytes / 1e9, 1),
            "total_tb": round(self.total_storage_bytes / 1e12, 1),
            "servers": self.servers,
        }

# Twitter feed
print(Estimate(
    dau=300_000_000, actions_per_user=20, bytes_per_action=1_000,
    retention_days=365, regions=5
).report())
# {'avg_qps': 69444, 'peak_qps': 208333, 'daily_gb': 6000.0,
#  'total_tb': 2190.0, 'servers': 150}

When the interviewer pushes back

They will ask "why 10K QPS per server and not 100K?" Know your answer: a read that hits a local cache and streams 1KB is ~0.1ms of CPU plus a network round trip; 1 core can do ~10K of those per second with keep-alive. If you are hitting the database you should say "per-server QPS drops to ~1K because now we are blocking on disk." Having the reasoning chain ready converts a panic moment into a calm defense.

The estimation playbook shows up in every case study in Module 12. Before you read those, make sure the flow DAU → QPS → bytes → servers is something you can do at a whiteboard in 90 seconds without your notes.

Module 2

CAP, PACELC & Consistency Models

The theorems that tell you what you cannot have, plus the practical consistency knobs every distributed system exposes.

CAP Theorem

Eric Brewer's CAP theorem says that any networked shared-data system can guarantee at most two out of three: Consistency, Availability, and Partition tolerance. Because network partitions are a fact of life in any multi-machine system, the practical choice is always between CP (consistency + partition tolerance, sacrificing availability during partitions) and AP (availability + partition tolerance, sacrificing consistency during partitions).

CAP triangle with example systems

Reading CAP correctly

The most common misread is "Mongo is CP so it is always consistent" or "Cassandra is AP so it never is." Both are wrong. CAP describes behavior during a partition. In the common case, both types are consistent and available. The label only tells you what happens when the network fails.

CP under partition: the minority side refuses writes (and often reads) rather than diverge. Example: ZooKeeper quorum loss, MongoDB primary election freeze.
AP under partition: both sides keep serving writes, accepting that they will conflict later. Example: DynamoDB multi-region writes, Cassandra LOCAL_ONE.
CA only exists if you assume the network cannot fail, which is true inside a rack but not across a WAN.

PACELC Extension

CAP only describes partition behavior. PACELC (Daniel Abadi, 2010) fills in the common-case gap: if there is a Partition, choose Availability vs Consistency; Else, choose Latency vs Consistency. Every real database has a four-letter code.

System	On partition	No partition	Code
Spanner	Prefer consistency	Prefer consistency (TrueTime stall)	PC/EC
HBase	Prefer consistency	Prefer consistency	PC/EC
BigTable	Prefer consistency	Prefer consistency	PC/EC
MongoDB (default)	Prefer consistency	Prefer consistency	PC/EC
Cassandra (default)	Prefer availability	Prefer latency	PA/EL
DynamoDB	Prefer availability	Prefer latency	PA/EL
Riak	Prefer availability	Prefer latency	PA/EL
Amazon RDS (multi-AZ)	Prefer consistency	Prefer latency	PC/EL

PACELC clarifies why Cassandra can feel faster than Spanner even when nothing is broken: Cassandra reads locally, Spanner waits for TrueTime certainty. The latency cost of strong consistency in the good case is what you trade away when you pick CP.

Consistency Models

"Consistency" is an overloaded word. There is a spectrum from strongest (slowest, most expensive) to weakest (fastest, cheapest), and each point has a precise formal meaning.

The spectrum

Linearizable (aka atomic): operations appear to take effect at a single point in time between their invocation and response. There is one global order, and it respects real time. Example: Spanner, single-node Postgres, etcd.
Sequential consistency: there is a single global order but it need not respect wall-clock real time. Example: early distributed systems papers; rarely exposed directly in practice.
Causal consistency: operations that are causally related (A "happens-before" B) are seen in that order by every observer; concurrent operations may be reordered. Example: Riak with dotted version vectors, COPS.
Read-your-writes: after a client writes X, subsequent reads by the same client see X. Important UX property ("did my post actually post?"). Example: DynamoDB session consistency.
Monotonic reads: a client that has read version N never subsequently reads version N−k. Prevents UI flickering backward in time.
Eventual consistency: if writes stop, replicas converge. That is all it guarantees.

Linearizable vs eventual: two clients, one key

Quorum math

Dynamo-style systems expose R (read replicas to contact), W (write replicas to acknowledge), and N (replication factor). When R + W > N, you get strong consistency because the read and write quorums always overlap; a new write must be seen by at least one of the nodes the read contacts.

# Quorum consistency examples for N = 3 replicas
N = 3

# Strong consistency options (R + W > N)
configs = [
    ("W=3 R=1", 3, 1),  # write-heavy: every write fans out, reads are fast
    ("W=2 R=2", 2, 2),  # balanced: both reads and writes tolerate 1 node down
    ("W=1 R=3", 1, 3),  # read-heavy: fast writes, slow/expensive reads
]
for name, W, R in configs:
    strong = (R + W > N)
    tolerates_write_fail = (N - W)
    tolerates_read_fail = (N - R)
    print(f"{name:10s}  strong={strong}  "
          f"write-fails={tolerates_write_fail}  "
          f"read-fails={tolerates_read_fail}")
# W=3 R=1   strong=True  write-fails=0  read-fails=2
# W=2 R=2   strong=True  write-fails=1  read-fails=1   <-- typical
# W=1 R=3   strong=True  write-fails=2  read-fails=0

Choosing the Right Model

The choice flows from the business invariant, not from an abstract preference.

Use case	Needed model	Why
Bank balance / ledger	Linearizable	Double-spend is a catastrophic failure. Serializable isolation required.
Airline seat booking	Linearizable (per seat)	Two customers cannot claim the same seat.
Inventory counter	Linearizable or counters with reservations	Overselling is a real dollar loss.
User profile edit	Read-your-writes + eventual	The user must see their own change; other viewers can lag.
Social feed timeline	Eventual + causal	A reply must never appear before its parent.
Like / view counter	Eventual	Off-by-a-few is fine; do not block on consensus.
DNS records	Eventual (long TTL)	Convergence is measured in minutes; perf matters more.
Distributed config	Linearizable (etcd/ZooKeeper)	All nodes must agree on "who is leader" at all times.

Notice that a single product often needs multiple models. Instagram's "like" counter can lag for minutes without anyone caring, but the "block list" must be strictly consistent or blocked users will see things they should not.

The practical rule

Default to eventual consistency, then add targeted strong-consistency surfaces where the business needs it. The reverse (start strong, weaken later) is expensive and error-prone because the whole stack already assumes a linearizable substrate. Module 6 shows how this maps to concrete database choices; Module 12 walks through how each case study picks consistency per feature.

Module 3

Scalability

Vertical vs horizontal, stateless services, session affinity, and when to reach for autoscaling.

Vertical vs Horizontal Scaling

Vertical scaling ("scale up") means making one machine bigger: more CPU, more RAM, faster disks. Horizontal scaling ("scale out") means adding more machines. The interview answer is almost always "horizontal, because vertical has a hard ceiling and a terrible failure story," but the real-world answer is often a combination.

Dimension	Vertical	Horizontal
Ceiling	Largest single box (a few TB RAM, hundreds of cores)	Effectively unbounded
Failure domain	One machine = whole system	One machine = small fraction
Complexity	Low — one box, one process	High — distributed coordination
Cost curve	Super-linear (big boxes are pricey per core)	Linear (commodity fleet)
State	Trivial — in-process	Must shard or externalize state
Typical use	Databases, in-memory analytics	Stateless web tier, microservices

The rule of thumb: always scale out the stateless parts first, then scale up the stateful parts until you hit a wall, then shard. Vertical scaling for the stateless tier is a cost mistake; horizontal scaling for the database as a reflex (without a good sharding key) is a correctness mistake.

Scaling cost vs capacity

Stateless Services

A stateless service keeps no per-client memory between requests. Every request carries (or looks up) everything it needs. The payoff is huge: any instance can serve any request, so load balancing is trivial, failover is instant, and horizontal scaling is a deployment setting.

What "stateless" actually means

No in-process session store. Sessions go to Redis or a signed cookie.
No local caches that affect correctness. Local caches as performance hints are fine.
No local file storage that must survive a deploy. Writes go to object storage.
No sticky state in memory across requests. If it must persist, it is in a database.

# Anti-pattern: in-process session
SESSIONS = {}  # dies with the pod; cannot scale out
def handle_request(req):
    user = SESSIONS.get(req.cookie("session_id"))
    # Every subsequent request must go to the SAME pod. Fragile.

# Better: JWT or Redis session
import jwt, redis
r = redis.Redis(host="session-cluster")

def handle_request_stateless(req):
    token = req.cookie("session_id")
    # Option A: signed JWT - no server state at all
    claims = jwt.decode(token, PUBLIC_KEY, algorithms=["RS256"])
    # Option B: short opaque token -> Redis lookup
    user_json = r.get(f"sess:{token}")
    return claims or json.loads(user_json)

Sticky Sessions & State

Sometimes you cannot avoid state. A WebSocket server, a game server, or a transcoding pipeline holds a real conversation with one client. In those cases the load balancer uses sticky sessions (also called session affinity): once a client lands on a server, all subsequent requests from that client go to the same server.

Cookie-based affinity: the LB issues a cookie; the client returns it; LB routes accordingly.
IP-hash affinity: hash(client IP) → backend. Works poorly with NAT pools (many clients, one IP).
Connection-based: for long-lived connections (WebSocket), the TCP connection itself is sticky.

Stickiness is an anti-scale pattern: if one client is hot (a whale user in a chat app), that server is now hot. Use it only when you must, and always combine with draining: when a server is to be removed, stop giving it new sessions but let old ones finish.

Autoscaling

Autoscaling adds or removes instances based on load. The happy path looks simple; the failure modes are where interviewers probe.

Signals

CPU — simple, lagging. A 2-minute average of 70% means you are already hot.
Request rate / QPS — leading indicator if you know your per-server capacity.
Queue depth — the best signal for async workers. Depth per consumer = proxy for backlog time.
p99 latency — business signal, but noisy and high-cardinality.

Gotchas

Boot time. If a new pod takes 60s to boot, autoscaling is always late. Keep warm pools or trim images.
Thrashing. Without a cooldown or hysteresis, the cluster flaps between scale-up and scale-down.
Downstream blowback. Adding web pods can overwhelm the database. Scale the bottleneck, not the symptom.
Cold caches. Fresh pods have empty local caches; scale-up can actually increase p99 while caches warm.

# Simple HPA-style policy
def desired_replicas(current_cpu, target_cpu, current_replicas):
    # Kubernetes HPA formula
    ratio = current_cpu / target_cpu
    return int(current_replicas * ratio + 0.5)

print(desired_replicas(80, 60, 10))   # 13 - scale up
print(desired_replicas(30, 60, 10))   # 5  - scale down

Module 4

Load Balancing

L4 vs L7, the algorithm zoo, health checks, and how modern systems handle regional and global traffic.

L4 vs L7

Load balancers split at two layers of the OSI stack. Know which one you need, because they solve very different problems.

Feature	L4 (TCP/UDP)	L7 (HTTP/gRPC)
Layer	Transport	Application
Inspects	IP + port + flags	URL, headers, cookies, body
Decisions per	Connection	Request (in a kept-alive connection)
TLS termination	Passthrough (or pass-through with SNI)	Terminates here
Latency overhead	10–50 μs	200–500 μs
Routing rules	Static backend pool	Path, header, cookie, weight, canary
Examples	AWS NLB, IPVS, HAProxy L4, envoy TCP	AWS ALB, nginx, Envoy HTTP, Traefik

Pick L4 for raw TCP/UDP (game servers, databases, legacy protocols) or when you need absolute throughput. Pick L7 whenever you need HTTP-aware routing: path-based microservices, canary weight, mTLS to backends, rate limiting per URL, WAF integration.

Modern clouds use both in layers: a global Anycast L4 terminates TCP near the user, a regional L7 parses HTTP and routes to services, and often a service mesh L7 sidecar (Envoy) handles mTLS and per-request retries. Each layer adds 100–500 μs; three layers are still under 2ms.

Algorithms

The load-balancer algorithm decides, for each incoming connection or request, which backend gets it. is the bar — the LB is on the hot path.

Round-robin vs least-connections vs consistent hash

The canonical algorithms

Round robin. Cycle through backends. Simple, fair if backends are identical, terrible if they are not.
Weighted round robin. Weights let you handle heterogeneous backends or stage canaries.
Least connections. Send to the backend with the fewest active connections. Good for long-lived connections with variable cost.
Least response time / least latency. Like least-conn but weighted by recent response time. Biases toward healthy backends.
IP hash. hash(client_ip) mod N. Sticky but terrible under NAT and when N changes.
Consistent hashing. Maps both keys and servers onto a ring; each key goes to the next server clockwise. Minimizes re-mapping when servers join or leave. Used for cache affinity (Memcached, Cassandra, CDN POPs).
Power of two random choices (P2C). Pick two random backends, route to the lighter. Gets 95% of the benefit of "pick the absolute least loaded" with 1% of the coordination. Used by Envoy, Finagle, YARP.

import random
import hashlib
from bisect import bisect

class RoundRobin:
    def __init__(self, backends): self.bs, self.i = backends, 0
    def pick(self, key=None):
        b = self.bs[self.i % len(self.bs)]; self.i += 1
        return b

class LeastConn:
    def __init__(self, backends):
        self.conn = {b: 0 for b in backends}
    def pick(self, key=None):
        return min(self.conn, key=self.conn.get)
    def on_open(self, b):  self.conn[b] += 1
    def on_close(self, b): self.conn[b] -= 1

class ConsistentHash:
    """150 virtual nodes per backend balances ring density."""
    def __init__(self, backends, vnodes=150):
        self.ring = []  # (hash, backend)
        for b in backends:
            for v in range(vnodes):
                h = int(hashlib.md5(f"{b}-{v}".encode()).hexdigest(), 16)
                self.ring.append((h, b))
        self.ring.sort()
        self.hashes = [h for h, _ in self.ring]
    def pick(self, key):
        h = int(hashlib.md5(key.encode()).hexdigest(), 16)
        i = bisect(self.hashes, h) % len(self.ring)
        return self.ring[i][1]

class P2C:
    def __init__(self, backends):
        self.conn = {b: 0 for b in backends}
    def pick(self, key=None):
        a, b = random.sample(list(self.conn), 2)
        return a if self.conn[a] <= self.conn[b] else b

Health Checks

A load balancer is only as good as its view of which backends are healthy. Two kinds of health checks:

Passive. The LB observes real traffic. A backend that returns 5xx or times out gets marked unhealthy (outlier ejection). Free, but slow to react to silent failures.
Active. The LB periodically hits a /healthz endpoint. Configurable interval (1–10s), timeout, and success threshold (e.g., "3 in a row").

Health check design rules

A /healthz that only checks "am I running?" hides deep failures. At minimum, check the most critical dependency.
But checking all dependencies causes correlated removals: if the DB is slow, every backend is unhealthy, and the fleet empties.
Split liveness ("restart me?") from readiness ("send me traffic?"). Kubernetes learned this the hard way.
Add jitter to the interval so checks do not synchronize and hammer a dependency.

# Balanced health check
@app.get("/healthz/live")
def live():
    return {"ok": True}  # only checks the process is up

@app.get("/healthz/ready")
def ready():
    # Ready means: I can serve traffic RIGHT NOW.
    checks = {
        "db":    _ping(db, timeout_ms=100),
        "cache": _ping(cache, timeout_ms=50),
    }
    # Trick: if DB is down, stay ready so that graceful error pages still render.
    # Become unready only if we have lost a local dependency we cannot recover from.
    return {"ok": checks["cache"]}, 200 if checks["cache"] else 503

Global & Anycast

Global load balancing puts a user on the closest (or best) region. Three techniques:

DNS-based. www.example.com has different A records per region; Route53 / NS1 geolocation picks one based on the resolver's IP. Pros: simple. Cons: TTL-limited (60–300s minimum failover), and resolvers lie.
Anycast. The same IP is advertised from multiple POPs via BGP; packets take the BGP-shortest path. Pros: instant failover, no DNS. Cons: requires running your own AS.
Application-layer routing. A tiny front-door service does a geo-lookup and 302s the client. Pros: flexible. Cons: extra round trip.

Modern CDNs use Anycast on port 443 to land the user at the closest POP, terminate TLS there, and then use L7 to pick a backend region. That is how Cloudflare, Fastly, AWS Global Accelerator, and Google Cloud Load Balancing are wired.

Module 5

Caching Layers

Where caches live, which write pattern to pick, and how to survive invalidation, stampedes, and hot keys.

Cache Hierarchy

A modern request passes through up to five cache layers before it touches the database. Each one takes a bite out of latency and cost; each one also adds complexity and correctness risk.

Browser cache. Controlled by Cache-Control headers. Free egress. Stale risk: can live for days.
CDN / edge cache. Cloudflare, Fastly, CloudFront. Pop-local SSD. Reduces origin QPS by 80–99% for static assets.
Reverse proxy cache. Varnish / Nginx in front of the app. Micro-cache popular dynamic responses (1–5s).
Application cache. Redis / Memcached in the same region. Sub-millisecond.
Database buffer pool. The DB itself caches hot pages. Still one round trip but no disk.

Request flow through cache layers

Hit-rate math

Effective latency = p_hit × t_hit + (1 - p_hit) × t_miss. For a Redis-fronted database with 90% hit rate, 1ms hit, 50ms miss: 0.9 × 1 + 0.1 × 50 = 5.9ms. Improve hit rate to 99% and you get 0.99 × 1 + 0.01 × 50 = 1.49ms. The last few percent of hit rate carry the most latency value.

Write-Through / Back / Around

Every write to cached data forces a choice about who writes first and how.

Pattern	Write path	Pro	Con
Cache-aside (lazy)	App writes DB; invalidates cache; next read fills	Simple; resilient if cache dies	First read after write is slow; two round trips
Write-through	App writes cache; cache writes DB synchronously	Cache always warm and consistent	Every write pays double latency; DB outage blocks writes
Write-behind (write-back)	App writes cache; cache buffers DB write asynchronously	Fastest writes; coalesces updates	Data loss on cache crash; complex operator
Write-around	App writes DB; cache is not populated on write	Avoids cache churn on write-once data	Cold reads; only good when reads are rare or deterministic
Refresh-ahead	Cache proactively refreshes near TTL expiry	No cold reads; smooth latency	Extra load; needs access prediction

Cache-aside read/write dance

# Cache-aside pattern - the interview default
import redis, json
r = redis.Redis()

def get_user(user_id, db):
    key = f"user:{user_id}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    row = db.query_one("SELECT * FROM users WHERE id=%s", user_id)
    if row:
        # 5-minute TTL bounds staleness even if invalidation is missed
        r.setex(key, 300, json.dumps(row))
    return row

def update_user(user_id, patch, db):
    db.execute("UPDATE users SET ... WHERE id=%s", user_id)
    # Delete (not SET) - avoids a race where our stale SET beats
    # another writer's SET.
    r.delete(f"user:{user_id}")

Invalidation Strategies

"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton. Five mechanisms, each with a cost.

TTL. Every entry expires after N seconds. Cheapest mechanism. Upper bound on staleness = TTL.
Explicit delete on write. App deletes cache key whenever it updates the DB. Fragile under replication.
Versioned keys. Key contains a version: user:42:v17. On write, bump version. Old entries become unreachable and get evicted naturally.
Tag-based invalidation. Each key carries tags; a write invalidates all keys with a matching tag. Varnish, Fastly, and most CDNs expose this.
CDC-based invalidation. A change-data-capture pipeline (Debezium, DynamoDB Streams) emits a stream of DB changes; a consumer invalidates cache entries. Most robust; most complex.

Stampedes & Failures

Caches amplify problems as often as they solve them.

Cache stampede (dog-piling)

A hot key expires. Thousands of requests simultaneously miss, all hit the DB, and the DB melts. Solutions:

Single-flight / request coalescing: only one request per key goes to the DB; others wait on the same future.
Soft-TTL + jitter: start refreshing before expiry; add 10–20% random jitter so expiries do not synchronize.
Probabilistic early expiration: each read has a small, rising chance of triggering a refresh as TTL approaches.

# Single-flight (simplified) - Go's golang.org/x/sync/singleflight in Python
import threading
class SingleFlight:
    def __init__(self):
        self.lock = threading.Lock()
        self.in_flight = {}  # key -> Event
        self.results  = {}   # key -> value

    def do(self, key, fn):
        with self.lock:
            if key in self.in_flight:
                ev = self.in_flight[key]
                self.lock.release()
                ev.wait()
                self.lock.acquire()
                return self.results[key]
            ev = threading.Event()
            self.in_flight[key] = ev
        try:
            val = fn()
            with self.lock:
                self.results[key] = val
                del self.in_flight[key]
            return val
        finally:
            ev.set()

Cache penetration

Queries for keys that do not exist always miss and go to the DB. Mitigation: negative caching (cache the miss with short TTL) and Bloom filters (reject known-non-existent keys before the DB round trip).

Hot key

One celebrity account gets 1M QPS; one Redis shard cannot handle it. Mitigation: key splitting (post:42#shard=random(0..9) replicated to 10 shards, aggregate on read), local in-process cache for ultra-hot entries, read replicas.

Cache avalanche

Redis cluster loses a node; hit rate drops from 99% to 60%; DB load 30×s. Mitigation: over-provision the DB for cache-empty worst case, or at least for 50% miss rate during failover.

Module 6

Databases in High-Level Design

SQL vs NoSQL vs NewSQL framed as a choice of workload, plus the replication and sharding decisions that follow.

SQL vs NoSQL vs NewSQL

Stop thinking of SQL/NoSQL as "old vs new." Think of them as different shapes of workload, each optimal for different access patterns.

Family	Shape	Strength	Weakness	Examples
Relational (OLTP)	Fixed schema, joins, transactions	Correctness, flexibility of query	Horizontal scaling is hard	Postgres, MySQL, SQL Server
Key-value	Opaque blobs by key	Blazing fast, trivially sharded	No query beyond the key	DynamoDB, Redis, RocksDB
Wide-column	Rows with dynamic columns, clustered by key	Huge write throughput, time-series friendly	Weak secondary indexes, no joins	Cassandra, HBase, Bigtable
Document	JSON-ish per document	Schema flex, single-doc ACID	Cross-document consistency is DIY	MongoDB, Couchbase, Firestore
Graph	Nodes + edges	Arbitrary-depth relationships	Batch analytics are awkward	Neo4j, Neptune, JanusGraph
Columnar / OLAP	Columns stored separately	Aggregations over billions of rows	Single-row lookups are slow	BigQuery, Snowflake, Redshift, ClickHouse
Search	Inverted index	Full-text, faceting, ranking	Not a source of truth	Elasticsearch, OpenSearch, Solr
NewSQL	SQL API on distributed storage	Scale + ACID + SQL	Heavy to operate, expensive	Spanner, CockroachDB, YugabyteDB, TiDB
Time-series	Append-only by time	High ingest, retention policies	Point lookups ok but not the sweet spot	InfluxDB, Prometheus, TimescaleDB
Ledger / immutable	Append-only, verifiable	Audit + tamper evidence	Niche, expensive	QLDB, Fauna

Replication

Replication keeps multiple copies of the data across machines or regions. It buys you durability, read capacity, and regional latency, but costs you write latency and storage.

Topologies

Single-leader (primary-replica). All writes go to the leader; reads fan out to followers. Simple; leader is a single point of failure until failover.
Multi-leader. Several regions can accept writes independently. Must resolve conflicts (LWW, CRDTs, application logic). Examples: CockroachDB multi-region, Active-Active Postgres via BDR.
Leaderless (quorum). Any replica accepts any op; correctness from R + W > N. Dynamo/Cassandra.

Synchronous vs asynchronous

Synchronous. Leader waits for replicas to ack before returning. Zero data loss on failover; write latency = round-trip to slowest required replica.
Asynchronous. Leader returns immediately; replicas catch up. Fast writes; potential data loss if leader dies before replication.
Semi-synchronous. At least one replica must ack. MySQL default. Good compromise.

If an interviewer says "zero RPO" (recovery point objective), that is code for "synchronous replication, minimum 2 replicas." If they say "5-minute RPO," asynchronous with reliable monitoring is fine.

Leader-follower replication timeline

Sharding

Sharding (horizontal partitioning) splits data across multiple databases by key. Each shard is independent, so reads and writes scale linearly — as long as your queries include the shard key.

Shard key strategies

Strategy	Formula	Pros	Cons
Range	`key < A` → shard 1	Range queries work; ordered iteration	Hotspots (recent time)
Hash	`hash(key) mod N`	Even distribution	No range queries; reshard is expensive
Consistent hash	Ring with virtual nodes	Minimal key movement on reshard	Complex; uneven without vnodes
Geo / directory	Lookup table	Flexible; compliance-friendly	Directory is a new SPOF

Cross-shard problems

Queries that do not include the shard key must scatter-gather to all shards. Avoid or maintain a secondary index.
Cross-shard transactions require two-phase commit or sagas. 2PC is slow and locks resources; sagas give up atomicity for availability.
Rebalancing when you add a shard can move enormous data. Consistent hashing and double-writes during migration keep it online.
Hot shards when one key is much more active. Use salted keys or sub-sharding.

# Consistent hashing with virtual nodes to avoid hot shards on reshard
import hashlib
from bisect import bisect

class ShardRing:
    def __init__(self, shards, vnodes=256):
        self.ring = []  # sorted (hash, shard)
        for s in shards:
            for v in range(vnodes):
                h = int(hashlib.sha1(f"{s}:{v}".encode()).hexdigest(), 16)
                self.ring.append((h, s))
        self.ring.sort()
        self._hashes = [h for h, _ in self.ring]

    def shard_for(self, key):
        h = int(hashlib.sha1(str(key).encode()).hexdigest(), 16)
        i = bisect(self._hashes, h) % len(self.ring)
        return self.ring[i][1]

    def add_shard(self, new_shard, vnodes=256):
        for v in range(vnodes):
            h = int(hashlib.sha1(f"{new_shard}:{v}".encode()).hexdigest(), 16)
            self.ring.append((h, new_shard))
        self.ring.sort()
        self._hashes = [h for h, _ in self.ring]
        # On add: only ~1/N keys move, not all keys.

Decision Tree

A compact choose-your-DB heuristic for the interview whiteboard:

Database selection decision tree

Ninety percent of whiteboard problems land on Postgres + Redis + Kafka + object storage. Reach for Cassandra when write throughput is measured in millions per second and there is a natural partition key. Reach for NewSQL when SQL matters AND scale matters AND you can afford to operate it.

Module 7

Messaging & Streaming

Queues vs pub/sub, the Kafka log model, delivery semantics, and how backpressure stops cascading failures.

Queues vs Pub/Sub

Message-oriented middleware solves three very different problems, and the wrong choice shows up as mystery data loss six months later.

Pattern	Delivery	Consumers per message	Typical use	Examples
Work queue	One of N workers	Exactly one	Background jobs, image processing	SQS, RabbitMQ, Celery
Pub/sub topic	Broadcast	Every subscriber gets a copy	Event notifications, cache busting	Redis Pub/Sub, SNS, Pulsar topics
Event log	Append + replay	Any consumer can replay from any offset	Event sourcing, analytics, CDC	Kafka, Kinesis, Pulsar
Task scheduler	Time-delayed	Exactly one at time T	Cron, retries, reminders	EventBridge, Cloud Scheduler, Temporal

The big confusion: Kafka is not a queue. Kafka is a durable, replayable, ordered log with consumer offsets. A consumer can replay history, form groups, and share partitions — none of which a classic queue supports.

Kafka Log Model

Kafka is an append-only, partitioned, replicated log. Three words carry all the meaning:

Append-only. Writes are sequential disk I/O — 100× faster than random. Brokers easily sustain 500MB/s per drive.
Partitioned. Each topic is N partitions. Messages with the same key go to the same partition (and therefore are strictly ordered).
Replicated. Each partition has a leader and R-1 followers. A write acks after min.insync.replicas follow.

Kafka partition + consumer group rebalance

Partitioning discipline

Pick a partition key that balances load. user_id if users are uniform; tenant-salted if one tenant dominates.
Number of partitions sets parallelism. Consumers can be ≤ partition count in the group. Pick 3× expected max consumers.
Never change partition count lightly. It reshuffles key-to-partition mapping and breaks ordering guarantees in flight.
Retention matters. Default 7 days. Infinite retention turns Kafka into an event store; plan disk accordingly.

# Kafka producer - synchronous, with required acks
from kafka import KafkaProducer
producer = KafkaProducer(
    bootstrap_servers=["broker-1:9092","broker-2:9092","broker-3:9092"],
    acks="all",                     # wait for all ISRs
    enable_idempotence=True,        # dedup on retry
    max_in_flight_requests_per_connection=5,
    retries=10,
    compression_type="snappy",
)

def publish_signup(user_id, payload):
    # Key by user_id so all events for a user land on one partition
    # and are consumed in order.
    future = producer.send(
        "user-signups",
        key=str(user_id).encode(),
        value=json.dumps(payload).encode(),
    )
    md = future.get(timeout=5)
    # Returning offset lets the caller log "committed at offset 1234"
    return (md.topic, md.partition, md.offset)

Delivery Semantics

Three levels. Each is a distinct engineering choice.

At most once: message may be lost, never duplicated. Fire-and-forget fits best for high-volume low-value telemetry where losing 0.1% is fine.
At least once: message always delivered, may be duplicated. The default for most queues. Consumers must be idempotent.
Exactly once: message delivered exactly once end-to-end. Achievable only with atomic commit across broker and consumer store. Kafka transactions + Kafka Streams provide it, but it is slower and more fragile.

Idempotency is the real answer

Even with "exactly once" semantics, network retries, operator reprocessing, and client timeouts will cause duplicates. Build the consumer to tolerate them.

# Idempotent consumer: record processed message IDs
def process(msg, db):
    msg_id = msg.headers["X-Message-Id"]
    with db.transaction():
        # Unique constraint on msg_id blocks duplicates atomically
        try:
            db.execute("INSERT INTO processed_msgs(id, at) VALUES(%s, now())", msg_id)
        except UniqueViolation:
            return  # already processed; safe no-op
        apply_effect(msg, db)

Backpressure

Backpressure is how a slow downstream tells a fast upstream to slow down. Without it, the fast side fills its queues, spills to disk, runs out of memory, and crashes.

Credit-based. Downstream issues N credits; upstream sends at most N in flight. gRPC flow control, reactive streams.
Queue-depth-based. Upstream checks queue depth; throttles or sheds load when full. Shed earlier to avoid memory blow-ups.
Consumer lag. In Kafka, lag (committed offset vs latest offset) is the honest backpressure signal. Alert when lag exceeds SLA.
Reject with 429 / 503. At the ingress, return "too busy" with Retry-After and let the client back off.

Load shedding patterns

Priority queues: shed low-priority first.
Admission control: reject new work when at capacity; complete in-flight work.
Circuit breakers: stop calling a flaky downstream; fall back.
Bulkheads: isolate thread pools so one bad tenant cannot starve others.

# Token bucket admission control at the ingress
import time
class TokenBucket:
    def __init__(self, capacity, refill_per_sec):
        self.cap = capacity
        self.refill = refill_per_sec
        self.tokens = capacity
        self.last = time.monotonic()

    def take(self, n=1):
        now = time.monotonic()
        self.tokens = min(self.cap, self.tokens + (now - self.last) * self.refill)
        self.last = now
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

tb = TokenBucket(capacity=1000, refill_per_sec=500)
def handle(req):
    if not tb.take():
        return Response(status=429, headers={"Retry-After": "1"})
    return process(req)

Module 8

Storage

Object, block, file. Tiers from hot to glacial. Where the CDN fits and why durability is measured in nines.

Object / Block / File

Type	Access	Good for	Bad for	Examples
Object	HTTP GET/PUT by key	Blobs, backups, static assets, data lake	Partial updates, low-latency writes	S3, GCS, Azure Blob
Block	Attached volume (SCSI/NVMe-oF)	Databases, filesystems, VMs	Shared access, massive scale	EBS, Persistent Disk, local NVMe
File	POSIX over NFS/SMB	Shared mutable files, lift-and-shift apps	Massive scale, high concurrency	EFS, Filestore, Azure Files

Object storage is the default for anything bigger than ~1MB that does not need byte-level mutation. Block storage is for anything that needs a real filesystem beneath a real process (databases, caches with persistence). File storage is the "we could not refactor this legacy app" tier.

Durability numbers

S3 Standard: 11 nines (99.999999999%). 1 object lost per 10,000 objects per 10 million years.
EBS gp3: 5 nines annual AFR per volume. Snapshot to S3 for long-term durability.
Glacier Deep Archive: 11 nines; retrieval 12+ hours; $1/TB/mo.

Hot / Warm / Cold Tiers

Not all data deserves SSD. Move it down a tier and save 10× in cost per month.

Storage tier cost vs retrieval latency

Tier	Example (AWS)	$/GB/mo	Retrieval	Use case
Hot memory	ElastiCache Redis	~$0.50	<1ms	Session, top lists
Hot block	EBS gp3	$0.08	1ms	Database volumes
Hot object	S3 Standard	$0.023	Tens of ms	Images, web assets
Warm	S3 IA / One Zone-IA	$0.0125	Tens of ms	Backups, older logs
Cold	Glacier Instant	$0.004	Tens of ms (pay to retrieve)	Compliance, rarely read
Frozen	Glacier Deep Archive	$0.00099	12+ hours	Regulatory, tape replacement

Lifecycle rules automate the transition. A typical log retention policy: Standard for 30 days → IA for 90 days → Glacier for 7 years → delete. Move 1 PB of logs and you save ~$200k/year compared to keeping everything hot.

CDN & Edge

A CDN (content delivery network) is a globally distributed reverse-proxy + cache. POPs (points of presence) near users serve assets; a cache miss goes back to origin. The economics: origin egress is $0.09/GB, CDN egress is $0.01–0.05/GB, with 99% hit rate the savings dominate.

What belongs at the edge

Static assets (JS, CSS, images, video chunks): cache forever with content-hashed filenames.
API responses that are user-agnostic (catalog, search facets, trending): short TTL micro-cache.
Personalized responses: generally not cacheable, but edge compute (Cloudflare Workers, Lambda@Edge) can do per-user assembly near the user.
Uploads: some CDNs accept writes and copy to origin; reduces upload latency for global users.

Module 9

APIs

REST vs gRPC vs GraphQL, the idempotency discipline, and how rate limiters actually work.

REST

REST is the lingua franca of public APIs: resources as nouns, HTTP verbs as actions, JSON bodies, and a discipline of hypermedia that almost nobody practices. Pragmatic REST is what every SaaS exposes: GET /users/42, POST /users, DELETE /users/42, and a handful of custom verbs where the resource model gets awkward.

Interview-level REST checklist

Resource naming: plural nouns, hierarchical (/users/42/orders/7).
HTTP methods: GET/POST/PUT/PATCH/DELETE; PUT is idempotent, POST is not.
Status codes: 200 OK, 201 Created, 204 No Content, 400 bad request, 401 unauth, 403 forbidden, 404 not found, 409 conflict, 429 throttled, 5xx server.
Versioning: path (/v1/), header (Accept: application/vnd.co.v2+json), or query (?v=2). Path is least clever and therefore best.
Pagination: cursor-based (opaque tokens, stable) over offset (breaks under insert/delete).
Auth: OAuth2 for user, API keys for server-to-server, mTLS for internal.

# Cursor-based pagination beats offset for stability
@app.get("/events")
def list_events(cursor: str = None, limit: int = 50):
    q = "SELECT * FROM events WHERE id > %s ORDER BY id LIMIT %s"
    rows = db.query(q, (decode_cursor(cursor) or 0, limit))
    next_cursor = encode_cursor(rows[-1].id) if len(rows) == limit else None
    return {"items": rows, "next": next_cursor}

gRPC

gRPC is Google's schema-first RPC framework over HTTP/2 with Protocol Buffers. The trade-offs:

Strongly typed: codegen in 10+ languages, compile-time errors.
Binary wire format: 3–10× smaller than JSON, parses ~10× faster.
Streaming: server streaming, client streaming, bidi streaming all first-class.
HTTP/2 multiplexing: many concurrent calls per connection; no head-of-line blocking.
Not browser-friendly: needs gRPC-Web or Connect proxy for browsers.
Harder to debug: not human-readable; need a gRPC-aware tool (grpcurl, Postman).

Use gRPC for internal service-to-service where you control both ends and throughput matters; use REST for external APIs and browser-facing surfaces. If you need both, define the service in proto and generate a REST gateway.

GraphQL

GraphQL lets the client specify exactly the fields it wants, in one request. It solves the "mobile app needs 6 REST calls to render one screen" problem.

One endpoint, POST queries and mutations.
Typed schema: IDL describes every object and field.
Single round trip for nested data.
N+1 resolver problem: naive resolvers do one DB call per parent, which explodes under load; solved by DataLoader batching.
Caching is harder: no natural URL-keyed cache; responses often unique per client.
Security: queries can be arbitrarily deep; enforce depth limits and query cost budgets.

Idempotency

An operation is idempotent if applying it N times has the same effect as applying it once. Every write API that clients retry must be idempotent or you will get duplicate charges, double bookings, and angry customers.

Idempotency keys

The classic pattern: client generates a UUID per intent; server deduplicates.

# Server-side idempotency with a Redis dedup window
def create_payment(request, idem_key):
    # Look up prior result within a 24h window
    cached = r.get(f"idem:{idem_key}")
    if cached:
        return json.loads(cached)  # return the original result

    # Acquire a short lock to dedup concurrent retries
    got_lock = r.set(f"idem:lock:{idem_key}", "1", nx=True, ex=10)
    if not got_lock:
        # Another worker is processing - wait, poll result
        raise Retry(after_ms=100)

    try:
        result = charge_card(request)
        r.setex(f"idem:{idem_key}", 86400, json.dumps(result))
        return result
    finally:
        r.delete(f"idem:lock:{idem_key}")

Stripe's idempotency header is the industry reference: Idempotency-Key: uuid, 24h retention, returns the original response on repeat (including the status code), even if the parameters differ slightly. Different parameters on same key → 409 Conflict.

Rate Limiting

Every public API is rate-limited. The algorithms:

Token bucket rate limiter

Algorithms

Fixed window. count(key, minute_bucket). Simple, but allows 2× burst at window boundary.
Sliding window log. Keep last-60s timestamps in a sorted set; count. Accurate; memory-heavy.
Sliding window counter. Weighted average of current and previous window counter. Cheap approximation.
Token bucket. Tokens refill at R/sec up to C; each request takes 1. Allows bursts up to C. The workhorse.
Leaky bucket. Queue with fixed drain rate. Smoothes output (no bursts); may add latency.

# Redis sliding-window counter (approximate, cheap)
def allow(key, limit_per_min):
    now = int(time.time())
    cur_bucket = now // 60
    prev_bucket = cur_bucket - 1
    cur = int(r.get(f"rl:{key}:{cur_bucket}") or 0)
    prev = int(r.get(f"rl:{key}:{prev_bucket}") or 0)
    elapsed = now % 60
    # Weighted rate: current counter + (1 - elapsed/60) of previous
    rate = cur + prev * ((60 - elapsed) / 60)
    if rate >= limit_per_min:
        return False
    r.incr(f"rl:{key}:{cur_bucket}")
    r.expire(f"rl:{key}:{cur_bucket}", 120)
    return True

Returning throttled responses

Always set X-RateLimit-Limit, X-RateLimit-Remaining, and Retry-After. Clients can self-pace; your support team stops receiving confused tickets.

Module 10

Search & Recommendations

Inverted indexes, ranking heuristics, vector search, and how modern recommenders glue them together.

Inverted Index

A database's B-tree answers "give me rows where id = 42." An inverted index answers "give me documents that contain the word database," which is a fundamentally different shape.

Anatomy

Analyzer: tokenize → lowercase → stem (running → run) → remove stopwords.
Term dictionary: sorted list of every term with document frequency.
Posting list: per term, list of (doc_id, term_frequency, positions), usually delta-encoded and compressed.
Skip lists: for fast AND / OR intersections without scanning every posting.

Inverted index lookup

Ranking: TF-IDF and BM25

Relevance = how well the document matches the query. The canonical formulas:

TF (term frequency): how often the term appears in the doc. More = more relevant.
IDF (inverse document frequency): rare terms are more informative than common ones.
BM25: TF-IDF with length normalization and a saturating TF curve. Default scorer in Lucene/Elasticsearch.

import math
def bm25(term_freq, doc_len, avg_doc_len, doc_freq, total_docs, k1=1.5, b=0.75):
    idf = math.log((total_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1)
    tf_norm = (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * doc_len / avg_doc_len))
    return idf * tf_norm

# Per-doc score = sum of BM25 over query terms

Ranking & Relevance

Textual match is only the first pass. Production search layers two, sometimes three, ranking stages:

Candidate generation: lexical search returns top ~1000 docs using BM25. Fast, imprecise.
Feature-rich re-ranking: a learned model (GBDT, neural net) scores the 1000 using hundreds of features (freshness, clicks, personalization, authority, time-of-day).
Business rules: pin editorial picks, boost regional content, demote low-quality domains.

The split is because re-ranking is expensive per doc (100 μs to 1 ms), so you want the candidate set to be small but not so small you lose recall.

Evaluation metrics

Precision@k: fraction of top-k results that are relevant.
Recall@k: fraction of all relevant results that are in top-k.
MRR (mean reciprocal rank): how high is the first relevant result on average.
NDCG (normalized discounted cumulative gain): sum of graded relevance, discounted by log position.

Vector Search

Embed documents and queries into N-dimensional vectors (typically 384–1536 dims) with a learned model; find nearest neighbors by cosine similarity or L2 distance. This catches synonyms and semantic matches that lexical search misses.

ANN (approximate nearest neighbor) algorithms

IVF (inverted file): cluster vectors into K buckets; search the closest M buckets. Good up to ~100M vectors.
HNSW (hierarchical navigable small world): multi-layer graph; O(log N) per query with high recall. The default in Pinecone, Weaviate, Qdrant, pgvector.
ScaNN / Annoy: tree-based, ideal for read-heavy static corpora.
Product Quantization: compress vectors 8–32×; trade recall for memory.

# pgvector query shape
"""
SELECT id, text, embedding <-> %s AS distance
FROM documents
ORDER BY embedding <-> %s
LIMIT 10;
"""
# With an HNSW index: ~5-20ms on 10M vectors, recall >= 95%.

Hybrid search

The best retrieval combines lexical (BM25) and vector (cosine) scores. Reciprocal Rank Fusion (RRF) is the simplest merge: for each result, score = sum(1 / (k + rank_in_list)). RRF works even when you cannot normalize the two scores.

Recommendations

Recommendation systems answer "what should this user see next?" The modern stack is similar to search: candidate generation → re-ranking.

Candidate generators: collaborative filtering ("users like you also liked"), content-based ("items similar to what you already liked"), popularity, trending, recent.
Two-tower model: user tower encodes user features, item tower encodes item features, dot product scores. Trained on clicks; served as an ANN index.
Re-ranker: a full-context model (gradient boosted tree or transformer) with hundreds of features per (user, item) pair.
Exploration: ε-greedy or Thompson sampling to show some non-obvious items so the system keeps learning.

Feedback loop trap

A recommender trained only on clicks over-rewards clickbait. Mix in explicit feedback (likes, follows, purchases, dwell time), negative signals (skips, hides), and diversity constraints.

Module 11

Observability

The three pillars, the SLI/SLO/SLA discipline, and what "good" on-call looks like.

Metrics / Logs / Traces

Observability is the superset of monitoring. You monitor known-unknowns (CPU, QPS, error rate); you observe unknown-unknowns (why did p99 spike for users on iOS 17 in Brazil at 03:15 UTC). The three pillars each answer a different question.

Pillar	Answers	Cost model	Cardinality	Tools
Metrics	Is anything wrong right now?	Cheap (aggregated time series)	Low (5–50 series per metric)	Prometheus, DataDog, CloudWatch
Logs	What exactly happened on this one event?	Expensive (every event stored)	High (free-form)	Elastic, Splunk, Loki, CloudWatch Logs
Traces	Where did this request spend its time?	Medium (sampled)	Medium	Jaeger, Tempo, X-Ray, DataDog APM

Distributed trace flame graph (span timeline)

Golden signals (SRE book)

Latency: how long requests take. Track p50, p95, p99 separately.
Traffic: QPS, bytes/s, active connections.
Errors: 5xx rate, 4xx rate, exceptions per second.
Saturation: CPU, memory, disk I/O, queue depth, DB pool utilization.

RED vs USE

Two mnemonic dashboards. RED (Rate / Errors / Duration) for request-driven services. USE (Utilization / Saturation / Errors) for resources (CPU, disk, queue). Most production teams wire both.

# Prometheus metric shape
from prometheus_client import Counter, Histogram
req_total = Counter("http_requests_total", "HTTP requests",
                    ["method","route","status"])
req_latency = Histogram("http_request_duration_seconds",
                        "HTTP latency",
                        ["method","route"],
                        buckets=[0.005, 0.01, 0.025, 0.05, 0.1,
                                 0.25, 0.5, 1, 2.5, 5, 10])

@app.middleware("http")
async def m(req, nxt):
    start = time.perf_counter()
    resp = await nxt(req)
    req_total.labels(req.method, req.url.path, resp.status).inc()
    req_latency.labels(req.method, req.url.path).observe(
        time.perf_counter() - start)
    return resp

SLI / SLO / SLA

Three nested concepts. Get them right and your on-call rotation stops firefighting "my CPU is at 70%."

SLI (Service Level Indicator). A measurable thing. Example: "fraction of HTTP 200 responses under 300ms."
SLO (Service Level Objective). A target for the SLI. Example: "99.9% of requests over a 30-day window."
SLA (Service Level Agreement). Contract with customers, with teeth (refunds). Usually weaker than SLO.

Error budget

If your SLO is 99.9%, you are allowed to be "bad" 0.1% of the time — that is 43 minutes per 30-day window. That 43 minutes is your error budget. Spend it on deploys, experiments, and tolerated failures. If you blow it, stop shipping features and invest in reliability until the budget replenishes.

# Error budget burn rate
slo_target = 0.999
window_minutes = 30 * 24 * 60  # 43200
budget_minutes = window_minutes * (1 - slo_target)  # 43.2

# A 10% burn in 1 hour means we'll exhaust the budget in 10 hours.
# Alert threshold: "fast burn" = 2% of budget in 1 hour = ~1% error rate for an hour.

Runbooks & On-Call

An alert without a runbook is a 3am puzzle. Every paging alert should link to a page that describes:

What this alert means in business terms ("checkout is broken for US East").
How to confirm (what dashboard, what log query).
How to mitigate (failover command, circuit breaker flip, feature flag off).
How to escalate (who owns this service, what Slack channel).
How to root-cause (trace query, recent deploys link).

Alert quality rules

Alert on symptoms, not causes. "Orders per minute dropped 50%" is a symptom; "db_conn_pool_exhausted=true" is a cause and may be noise.
Page only on things that need human action within 30 min. Everything else is a ticket.
No flapping. Alert hysteresis: require 3 consecutive breaches before paging; 5 consecutive normal to clear.
Postmortem every page. Was the alert justified? Was the runbook enough?

Module 12

Case Studies

Ten canonical interview problems: scope, high-level design, hardest sub-problem, back-of-envelope, and where each one commonly fails.

URL Shortener (bit.ly, TinyURL)

Scope. Accept a long URL, return a short alias (e.g. bit.ly/abc1234). On GET of the short URL, 301 to the original. Optional: per-link analytics, custom aliases, expiry.

Requirements. 100M new URLs/month = ~40 URLs/sec average, ~120/sec peak. Redirects are ~100× that: ~12k RPS peak. 5-year horizon = 6B total URLs. Read/write ratio 100:1. Redirect p99 < 100ms.

URL shortener high-level design

Hardest sub-problem: key generation

How do you generate a unique 7-character base-62 key without a central bottleneck? Three approaches:

Hash of URL (SHA-256 truncated to 7 chars). Fast, deterministic. Collisions at scale: birthday bound with 62⁷ ≈ 3.5×10¹² means after ~2M URLs, ~0.01% collision rate — need to probe and re-hash.
Counter-based. Central counter → base62-encode. One DB row is a bottleneck. Partition the counter (each shard owns a range).
Pre-generated key pool. Batch job pre-allocates keys in a table; API pops the next free one. Decouples write path from key generation.

The pragmatic answer in production: each write-API shard reserves a 1000-key range from a central atomic counter (so one DB write every 1000 URLs), then hands out keys in-process. Simple, fast, rebalances on pod restart.

# Key generation: range reservation per pod
import threading
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

class KeyAllocator:
    def __init__(self, db):
        self.db = db
        self.lock = threading.Lock()
        self.start = 0
        self.end = 0

    def _reserve(self, size=1000):
        # Atomic: UPDATE counters SET next = next + size RETURNING next - size
        row = self.db.fetch_one(
            "UPDATE shortener_counter SET next = next + %s RETURNING next - %s",
            (size, size))
        self.start, self.end = row[0], row[0] + size

    def next(self):
        with self.lock:
            if self.start >= self.end:
                self._reserve()
            v = self.start
            self.start += 1
        return self._encode(v)

    @staticmethod
    def _encode(n):
        if n == 0: return "0"
        digits = []
        while n:
            digits.append(BASE62[n % 62])
            n //= 62
        return "".join(reversed(digits)).rjust(7, "0")

Back-of-envelope

Storage: 6B URLs × 500 bytes (long URL + metadata) = 3 TB. Fits in a sharded Postgres or DynamoDB easily.
Read QPS: 12k peak. One Redis primary does that; replicate across 3 AZs.
Cache hit rate: power-law skew means top 10% of links drive 90% of traffic. Plan 95%+ hit rate.
Write QPS: 120 peak. Trivial for any DB.
Cost: <$500/month on AWS at this scale.

Failure modes

Custom aliases collide with generated keys; reserve a prefix (e.g., all generated keys start with 1, customs with a letter).
Malicious shorts: phishing, malware. Run safe-browsing checks before redirect; allow takedown.
Hot-link bursts: viral tweet points to one short. Cache-aside handles it; otherwise shard the single DB row.

URL shortener: read vs write path latency

Consistent-hash ring for KV shards

Deep dive: collision handling

If keys are SHA-truncated-to-42-bits, the birthday bound says collision probability hits 50% around 2²¹ ≈ 2M keys. In practice you don't wait for 50%; you detect collisions on INSERT IF NOT EXISTS and re-hash with a salt. The cost is one extra round-trip on ~0.01% of writes — acceptable when writes are 120 QPS. For counter-based schemes collisions are impossible by construction; the bottleneck becomes the counter itself. The production answer is almost always: counter with range reservation → base62 encode. It sidesteps collision logic entirely and scales linearly with the number of write pods.

Deep dive: custom aliases vs base62

Custom aliases (bit.ly/my-event) share the same namespace as generated keys. Three robust strategies:

Prefix split: generated keys are 7 chars starting with a digit; customs must start with a letter. No runtime collision check.
Separate table: custom_aliases joined at lookup. Slower read but cleanly isolates the two flows.
Unified table with uniqueness constraint: on INSERT of a custom, fail if it collides with a generated key. Cleanest but requires probing.

Validation: customs must avoid the reserved prefix set (admin paths like /login, /api), satisfy length bounds, and pass a profanity filter to prevent brand damage. Case-folding: BitLy/Sale and bitly/sale should collide or not by explicit policy — Twitter shorts are case-preserving but lookup is case-insensitive; bit.ly is case-sensitive. Pick one and stick with it.

Deep dive: analytics sampling

Full per-click analytics at 12k RPS = 1B events/day. Write-every-click into a time-series DB is feasible but costly. Tiered approach: every click writes a Kafka event; a Flink/Spark Streaming job bucketizes into 1-min counts per key + per geo + per referrer; durable daily rollups land in a columnar store (ClickHouse, BigQuery). For P99 precision on long-tail links, use HyperLogLog for unique-visitor counts (2 KB of state gives <1% error for 10⁹ distinct visitors). Cross-reference Module 11 caching for hot-key detection via Count-Min Sketch.

Twitter-like Feed

Scope. Users post short messages; each user has followers; home timeline shows posts from people they follow in reverse-chronological (or ranked) order. 300M DAU; average user follows 200 accounts; celebrities followed by 100M+; peak post rate 500/s; peak read rate 200k/s.

Twitter feed HLD (fan-out hybrid)

Hardest sub-problem: fan-out

Two models, each with fatal drawbacks at scale:

Fan-out on write (push). When A posts, copy the post ID to every follower's timeline list. Good: reads are O(1). Bad: if A has 100M followers, one post does 100M list inserts. Celebrities melt the system.
Fan-out on read (pull). On feed load, fetch latest posts from each followed account and merge. Good: writes are O(1). Bad: reads are O(followees) — 200 Redis calls per feed open.

Hybrid (the actual answer): fan-out on write for normal accounts (< 10k followers), fan-out on read for celebrities (>= 10k followers). Each user's feed is the merge of "pushed posts in my inbox" + "pulled posts from celebrities I follow." This is what Twitter and Instagram actually do.

# Feed read: merge pushed inbox with pulled celebrity timelines
def home_timeline(user_id):
    pushed = redis.lrange(f"feed:{user_id}", 0, 200)  # already merged timeline
    celeb_followees = user_service.get_celeb_followees(user_id)
    pulled = []
    for c in celeb_followees:
        # Each celeb has a cached recent list (small, shared by all followers)
        pulled.extend(redis.lrange(f"user_posts:{c}", 0, 50))
    combined = merge_by_time(pushed, pulled)[:200]
    return post_service.hydrate(combined)

Back-of-envelope

Write amplification: 500 posts/s × average 200 followers = 100k Redis LPUSHs/s. One Redis shard with pipelining handles this; add capacity if celebs drift below the 10k threshold.
Read QPS: 200k/s feed opens × 1 Redis call per feed = 200k Redis ops/s. Shard by user ID across 8–16 primaries.
Timeline storage: 300M users × 200 post IDs × 8 bytes = 480 GB. Fits in Redis cluster.
Post storage: 500 × 86400 = 43M posts/day × 500 bytes = 21 GB/day = 7.8 TB/year. Cassandra or DynamoDB, partitioned by user.

Failure modes

Celebrity threshold tuning: too low → bad tail reads; too high → bad celeb writes. 10k is typical.
Timeline drift: if a push is lost, a user sees a hole. Occasional full rebuild from source DB.
Ranking: chronological is simple; "algorithmic" adds a re-ranker (see Module 10).
Spam & deletions: deleted posts must be filtered at read time; do not try to chase down 100M timeline entries.

Fan-out strategies contrasted (push vs pull vs hybrid)

Feed generation pipeline (write -> fanout -> read)

Deep dive: celebrity hybrid threshold

The "10k follower" cutoff is the parameter that the entire system pivots around. Below the threshold the user is "push": writes fan-out to every follower inbox. Above, the user is "pull": reads for anyone who follows them must include a celebrity-timeline fetch. The right threshold depends on your Redis write budget. If pushers are bounded by W write-ops/s globally, and normal users average F_norm followers, the threshold T satisfies posts_per_sec * F_norm < W. Empirically Twitter settled near 10k; Instagram is closer to 5k because stories fan-out is heavier. Below the threshold there are still outliers: a journalist with 9k followers who suddenly goes viral must be promoted to celebrity status without losing in-flight pushes. The operational pattern: maintain a "demotion queue" that freezes new fan-outs, flushes the old ones, then flips a flag that switches read path.

Deep dive: home timeline vs user timeline

Two distinct stores even though the words sound similar:

User timeline = "all posts I authored." Written once on post. Keyed by author_id. Useful for profile views and celebrity fan-out-on-read. Cassandra or DynamoDB, partitioned by author.
Home timeline = "merged feed shown to me." Written 200× per post due to fan-out. Keyed by viewer_id. Usually Redis-only (ephemeral); rebuild on cache miss by walking followees.

Conflating them in an interview is the most common anti-pattern. The home timeline is one-way derived data; it can always be rebuilt; losing a shard is recoverable. The user timeline is source-of-truth; losing it means lost posts.

Deep dive: ranked feed scoring

Once the merge candidate list exists (~500 items), re-rank by engagement prediction. Two-tower retrieval (embeddings) selects candidates; a gradient-boosted or small DNN model scores them. Features: recency decay, author affinity, post engagement velocity, content type (image/video/text), session signals. Budget: ~50ms for 500 candidates on a CPU inference server. Cross-ref Module 10 (Search & Recommendations) for the retrieve-rank-re-rank pattern.

WhatsApp / Chat

Scope. 1-to-1 and group chat. 2B users. 100B messages/day = ~1.2M messages/s average, ~3M/s peak. End-to-end encrypted. Online presence, read receipts, message history. Must work on flaky mobile networks.

WhatsApp-style chat HLD

Hardest sub-problem: online presence + message delivery

A chat client is online when it holds a TCP (WebSocket/XMPP) connection to a gateway. To route "Alice → Bob," the system must know which gateway Bob is on. That mapping changes every time Bob's phone goes through a tunnel.

Session registry: user_id → gateway_id. Updated on connect/disconnect. Stored in Redis with short TTL, refreshed by heartbeat.
Message routing: sender's gateway looks up recipient's gateway, forwards message. If recipient is offline, enqueue to persistent store and schedule APNs/FCM push.
Ordering: within a chat, messages must appear in send order. Use a per-chat monotonic sequence number (source of truth on the server).

# Simplified message flow
def on_send(sender_conn, chat_id, body):
    seq = chat_seq.incr(chat_id)     # monotonic per chat
    msg = {"chat": chat_id, "from": sender_conn.user_id, "seq": seq,
           "body": body, "ts": time.time()}
    msg_store.append(chat_id, msg)   # durable log per chat
    for uid in chat_members(chat_id):
        gw = session_registry.get(uid)
        if gw and gw != sender_conn.gateway_id:
            router.send(gw, uid, msg)   # forward via internal bus
        elif gw == sender_conn.gateway_id:
            gw_local_push(uid, msg)
        else:
            # offline - queue + push notification
            offline_queue.push(uid, msg)
            push_service.notify(uid, preview(msg))

Back-of-envelope

Connections: 1B concurrent online ÷ 1M connections per gateway = 1000 gateway pods. Each pod 16 cores, 32 GB RAM.
Message storage: 100B msg/day × 200 bytes = 20 TB/day. Retention depends on product policy; WhatsApp stores on client, server only queues for delivery.
Session registry: 2B entries × 50 bytes = 100 GB. Sharded Redis cluster with LRU on stale entries.
Peak send: 3M/s × average 10 recipients (group) = 30M router hops/s. Scale router horizontally; shard by chat_id.

Failure modes

Flaky networks: client must retry with same message-id; server dedups. Classic idempotency.
Group fan-out explosion: a 500-person group with one sender = 500 pushes. Rate-limit admins; cap group size.
Delivery ordering: out-of-order arrivals are resequenced by client using server-assigned seq number.
E2E encryption: server can see metadata (who → who, when) but not content. Complicates abuse reporting.

Session registry + multi-region routing

Message delivery timeline with offline queue

Deep dive: double-tick semantics

Four observable states on a sent message: sending (no network ack), sent (server persisted, single tick), delivered (recipient device acknowledged, double grey tick), read (recipient opened chat, double blue tick). The ticks are just enum fields on the message but the state machine is surprisingly subtle. Delivered ACK must be idempotent: if Bob's device re-connects and re-downloads, it mustn't re-ACK the same message ID and spam Alice's device. Read receipts are privacy-optional; if Bob disables them the server still tracks internally (for unread-badge counts) but suppresses the notification back to Alice. The source of truth is the per-chat max_seq_delivered and max_seq_read pair, stored per-participant, not per-message.

Deep dive: E2E encryption (Signal protocol)

WhatsApp uses the Signal protocol: X3DH for initial key exchange, Double Ratchet for forward secrecy, sesame for multi-device. Each message is encrypted with a per-message symmetric key derived from a chain, so compromising one key does not compromise history or future. Server sees envelope (sender id, recipient id, timestamp, size) but not plaintext. Implications for server design:

Server cannot dedup message bodies (each ciphertext looks random).
Group messages are still pairwise E2E: the sender encrypts N copies with N recipient keys. This is why a 500-member group does 500 encryptions on the sender device — bandwidth and CPU cost lives on the client.
Abuse / moderation must run on the client (Messages from Meta scans on-device) or on metadata.

Deep dive: group chat fan-out at 500+ members

Naive: sender encrypts 500 copies, uploads 500 ciphertexts, server routes to each recipient. At 500 members × 5 KB = 2.5 MB per send — fine on Wi-Fi, painful on cellular. Optimization: sender keys. Sender distributes a symmetric group key to each member once (pairwise E2E), then encrypts one ciphertext with the group key and uploads that once; server fans out the one ciphertext to all members. Forward secrecy is weaker (compromise of the group key exposes messages until rotation). Rotation: every N messages or when membership changes. This is how WhatsApp handles 1024-member groups without melting sender devices.

Uber Dispatch

Scope. Rider requests a car; system finds the nearest available driver; both get live ETA and map updates. 100M riders, 5M drivers; 15M rides/day; 10k dispatch decisions/s peak; location updates every 4s from each active driver = ~1M location updates/s.

Uber dispatch HLD

Hardest sub-problem: geospatial matching

For every rider request, find available drivers within ~3 km in under 100ms, in a city where there are thousands of drivers. The geospatial index is the whole game.

Grid cells (Uber H3 / Google S2). Earth tiled into hexagonal cells (H3) or square cells (S2). Each driver location maps to one cell ID. Query = "cells within radius R of rider cell."
Geohash (legacy). 5–9 char string encoding lat/lng; prefix match = nearby.
R-tree / k-d tree. Classic spatial index, but worse for distributed systems because rebalancing is tricky.

Uber's actual approach: dispatcher nodes each own a set of H3 cells. Driver location updates route to the owning dispatcher (consistent hash on cell ID). When a rider requests, the API looks up the rider's cell and routes to the same dispatcher, which has all nearby drivers in local memory.

# H3-style dispatch sketch
import h3

def on_driver_update(driver_id, lat, lng):
    cell = h3.geo_to_h3(lat, lng, resolution=9)  # ~150m cells
    dispatcher = cell_owner(cell)
    dispatcher.upsert(driver_id, (lat, lng, time.time()))

def dispatch(rider_lat, rider_lng, radius_m=3000):
    rider_cell = h3.geo_to_h3(rider_lat, rider_lng, resolution=9)
    # k-ring at resolution 9 (~150m) = ceil(3000 / 150) = 20
    nearby_cells = h3.k_ring(rider_cell, 20)
    candidates = []
    for c in nearby_cells:
        owner = cell_owner(c)
        candidates.extend(owner.drivers_in(c))
    # Rank by ETA (driving distance, not straight line)
    ranked = sorted(candidates, key=lambda d: eta_eta(rider_lat, rider_lng, d))
    return ranked[:5]

Back-of-envelope

Location writes: 1M/s × 50 bytes = 50 MB/s. One Kafka topic at acks=1 handles it.
Dispatcher fleet: 5M drivers ÷ 10k per pod = 500 pods. Cells sharded by consistent hash.
Dispatch latency: 100ms budget = 10ms network + 30ms cell lookup + 40ms ETA compute + 20ms commit.
Trip store: 15M rides/day × 2 KB = 30 GB/day = ~11 TB/year. Postgres sharded by city.

Failure modes

Surge pricing: rider volume spikes; add multiplier per H3 cell with expiring counters.
Double-dispatch: two riders match the same driver. Match service holds a short lease on the driver; second matcher sees "locked."
Driver offline mid-trip: cancel trip, refund, re-dispatch. Trip state machine must be resilient.
Region failover: ongoing trips cannot lose state. Replicate trip store synchronously across 2 AZs; accept the 2–5ms latency cost.

H3 grid cells + rider-driver matching

Ride request lifecycle timeline

Deep dive: surge pricing

Surge is a per-cell, per-time multiplier on the base fare. Compute it as surge = clamp(demand / supply, 1.0, 5.0) where demand = rider requests in the last 5 minutes and supply = available drivers. Implementation: streaming aggregation on Flink keyed by H3 cell; 30-second tumbling windows; exponential-moving-average smoothing to avoid flicker. Surge must be quoted up-front — a rider sees the multiplier before confirming. Once confirmed, the multiplier is locked on the trip row (snapshot pricing). If surge drops mid-trip the rider still pays the locked rate; if it rises they don't. Anti-gaming: cap surge per cell per hour; de-bias against drivers collectively going offline to manufacture surge (detected by correlating offline events with surge spikes).

Deep dive: ETA prediction pipeline

Two ETAs: "driver to rider" and "rider to destination." Both computed with a road-graph router (not straight-line). Offline batch builds the routable graph from OSM + proprietary corrections; online inference uses recent 5-minute traffic from GPS pings per edge. Feature vector per request: origin cell, dest cell, time-of-day one-hot, weather, day-of-week, edge-average-speed over last 15 min. Model: XGBoost typically, served as ONNX on CPU, sub-10ms per prediction. Cross-reference database Module 12 for streaming aggregation.

Deep dive: dispatch optimization as bipartite matching

Naive dispatch matches the first rider to the nearest driver greedily; this is the Hungarian algorithm's first iteration. But if you batch N riders and M drivers over a 3–5s window, you can solve min-cost bipartite matching globally — reducing average ETA across the batch by 5–15%. Uber's paper "Matching Statistics" calls this the "dispatching dispatch." Cost = ETA-to-pickup for each (rider, driver) pair; solve with Kuhn-Munkres in O(n³) — fine for N<1000 per batch. Failure mode: stale drivers (location ping older than 15s) get excluded to prevent assigning to a driver who's already moved. Cross-link: DSA Module 8 for graph matching.

Google Drive / Dropbox (storage)

Scope. Users upload files of any size; files sync to all their devices; files can be shared with other users. 1B users; average 10 GB stored per user = 10 EB total; 100k file ops/s peak.

File sync HLD

Hardest sub-problem: chunking, dedup, and resumable uploads

Uploading a 10 GB file over a flaky network will fail if you treat it as one request. The design splits into chunks (typically 4 MB); each chunk is stored once system-wide (content-addressed by hash).

Chunking: split file into 4 MB blocks. Compute SHA-256 per chunk. File metadata = ordered list of chunk hashes.
Dedup: blob store keyed by hash. PUT /blobs/<sha256> is a no-op if the blob already exists. A user re-uploading a shared PDF adds zero bytes globally.
Resumable uploads: client POSTs chunk by chunk; server tracks committed chunks; client can resume from first missing chunk.
Delta sync: on update, compute which chunks changed (rolling hash / rsync). Only push changed chunks.

import hashlib
CHUNK = 4 * 1024 * 1024

def upload_file(path, file_id):
    manifest = []
    with open(path, "rb") as f:
        while chunk := f.read(CHUNK):
            h = hashlib.sha256(chunk).hexdigest()
            manifest.append(h)
            if not blob_store.exists(h):
                blob_store.put(h, chunk)  # idempotent on hash
    metadata_service.put_manifest(file_id, manifest)

Back-of-envelope

Storage: 10 EB nominal; dedup ratio 1.5–3×; realistically 3–6 EB on disk. S3-equivalent at $0.023/GB/mo = ~$100M–$200M/month just for storage. Cheap tier (Glacier) brings that 10× down for old files.
Metadata DB: 1B users × 1000 files × 200 bytes metadata = 200 TB. Sharded Spanner or CockroachDB; strongly consistent (a file must exist when metadata says it does).
Ingress: 100k ops/s × avg 1 MB = 100 GB/s. Needs regional upload endpoints near users; CDN accelerates.
Notify fan-out: a file change pushes to all the owner's and shared devices — typically <20. Cheap.

Failure modes

Concurrent edit conflict: two clients edit same file offline, then sync. Either last-write-wins with a conflict copy (Dropbox), or operational transform / CRDT for collaborative editing (Google Docs).
Large file on mobile: resumable upload with exponential backoff; defer upload to Wi-Fi.
Deleted blob referenced by shared link: reference count in metadata; garbage-collect blobs with zero refs after grace period.
Quota enforcement: must be globally consistent (otherwise a user can burst past quota on multiple devices). Pre-reserve bytes in metadata; commit on success.

Chunking + content-addressed dedup + CDN

Conflict resolution for concurrent edits

Deep dive: Merkle tree chunking

File manifest is really a Merkle tree: leaves = chunk hashes, internal nodes = hash of concatenated children, root = file identity. Benefits: (1) verifying a download only requires hashing each received chunk and comparing to the leaf hash; (2) delta sync between two file versions reduces to comparing tree roots, then drilling into subtrees that differ — O(log n) to find changed leaves; (3) cross-file dedup is free because the same chunk hash appears in any file that happens to contain that block. Dropbox uses a 4 MB fixed chunk size; Git (packfiles) uses similar machinery; rsync's rolling hash is a variant. The root hash also makes garbage collection auditable: orphan blobs are those whose hash appears in no file's tree.

Deep dive: content-defined chunking (FastCDC)

Fixed 4 MB chunks break down when a user inserts a single byte near the start of a large file — every downstream chunk shifts and gets re-hashed. Content-defined chunking solves this: cut the file at points where a rolling hash (like Rabin-Karp or Gear) of the last few bytes matches a target pattern. Insert a byte near the start → only the chunk containing the insertion boundary changes; all later chunks keep their boundaries and hashes. FastCDC is Microsoft's 2016 refinement that's ~10× faster than classic Rabin and is what most modern sync clients use. Average chunk size 2–8 MB, min 1 MB, max 16 MB; the tradeoff is dedup effectiveness (smaller chunks = better dedup, more metadata).

Deep dive: collaborative editing (OT vs CRDT)

Google Docs uses Operational Transformation: ops (insert, delete, format) are sent to the server, which applies them in a single serialized order; late-arriving ops are transformed against ops that landed first so intent is preserved. Needs a central authority (the Docs server). Figma, Yjs, Automerge use CRDTs: ops carry vector clocks or Lamport timestamps; merges are commutative so any peer can apply in any order and converge. CRDTs tolerate arbitrary topologies (P2P, offline) but their data structures are bigger (per-character metadata for text CRDTs like RGA). For a file sync product with diverse clients, CRDTs scale to more offline cases; for a cloud-first editor, OT is simpler. Cross-link: DSA Module 8 (graphs) for dependency-tracking CRDT internals.

YouTube (video)

Scope. Users upload video; others stream it globally. 2B monthly users; 500 hours uploaded per minute; 1B hours watched per day; p99 start-play < 1s.

YouTube HLD

Hardest sub-problem: transcoding pipeline

A 1-hour 4K raw upload is ~20 GB. Viewers watch on everything from a flip phone on 3G to a 4K TV on fiber. The platform must produce dozens of (resolution, bitrate) variants efficiently.

Chunked transcoding: split raw into 10-second segments; transcode each segment in parallel across a worker fleet. A 1-hour video = 360 segments × 10 variants = 3600 tasks; done in minutes on 500 machines.
Adaptive bitrate (ABR): HLS or DASH manifests list variants; player picks based on measured bandwidth; switches per segment without rebuffering.
Content-aware encoding: per-scene bitrate tuning (Netflix calls it "per-title encoding"); same perceived quality at 30–50% fewer bytes.
Stall recovery: player buffers ahead; if buffer drops below threshold, switch to lower bitrate.

# Pipeline orchestration sketch
def on_upload_complete(video_id, raw_blob_url):
    segments = split_ffmpeg(raw_blob_url, segment_seconds=10)
    tasks = []
    variants = [
        (3840, 2160, 15_000_000),  # 4K
        (1920, 1080, 4_500_000),   # 1080p
        (1280, 720,  2_000_000),   # 720p
        (854,  480,  800_000),     # 480p
        (640,  360,  500_000),     # 360p
    ]
    for seg in segments:
        for (w, h, br) in variants:
            tasks.append({"video": video_id, "seg": seg.idx,
                          "w": w, "h": h, "br": br, "in": seg.url})
    queue.enqueue_batch("transcode", tasks)
    # A separate job assembles the manifest when all variant segments are done.

Back-of-envelope

Ingress: 500 hours/min × 2 GB/hour (pre-transcode) = ~17 GB/s sustained. Needs regional upload endpoints.
Transcode capacity: 500 hours/min × 10 variants × 2× realtime = ~10k CPU-cores or ~1k GPUs dedicated.
Storage: 500 hours/min forever. Ten-year-old videos dwarf today's. Cold tier for old, unviewed content; hot tier driven by view count.
Delivery bandwidth: 1B hours/day watched × 4 GB/hr = 4 EB/day. CDN does the heavy lifting; origin egress is a rounding error.

Failure modes

DMCA / content moderation: scan uploads with Content ID; block or demonetize. Must be fast enough not to delay publication visibly.
Live vs VoD: live streams cannot pre-generate all variants; transcode in real time on dedicated low-latency boxes.
Hot video: viral upload; CDN handles; origin has to tolerate the 1% cache miss burst.
Region coverage: a popular video in Brazil must be cached at Brazilian POPs within minutes of upload; run background pre-warming for trending content.

Transcoding pipeline: mezzanine -> DASH/HLS ladder

Watch path request timeline

Deep dive: recommendation pipeline (two-tower -> ranking -> re-ranking)

YouTube's recommender has three stages:

Candidate generation (two-tower): a user tower encodes (watch history, demographics, session) into an embedding; a video tower encodes (title, tags, creator, topic) into an embedding. Retrieval = approximate-nearest-neighbor over the video tower index (ScaNN or HNSW). Picks ~1000 candidates from 10⁹ videos in <20 ms.
Ranking: a deep network scores (user, video, context) → P(watch), P(like), expected watch time. Scores the 1000 candidates in ~100 ms on TPU.
Re-ranking: diversification, freshness boost, policy filters (no re-showing just-watched, enforce creator diversity), bandit exploration mixing. Outputs final ~25 items.

Cross-ref Module 10 for the general retrieve-rank-re-rank pattern and vector DBs.

Deep dive: storage tiering (hot / warm / cold)

Viewing follows a power law: top 0.01% of videos drive 50% of watch-hours. Tiers:

Hot: last-90-day views > 100k. Kept on SSD at POPs globally. ~1% of catalog, ~60% of bytes served.
Warm: in regional object storage; fetched to POP on demand with a 200–500 ms origin-fill.
Cold: Glacier-class tape or SMR disk; 1–5 s first-byte; acceptable for the long tail (archive history).

Transitions are automated on view-count decay. Re-heat is triggered by a spike detector — if warm-tier requests cross a threshold in <10 min, promote back to hot.

Deep dive: thumbnail generation

On upload, extract ~100 candidate frames using a shot-boundary detector; score them with a CNN trained on click-through predictions; keep top 3; surface via A/B test bandit (Thompson sampling) so the best-performing thumbnail wins across actual viewers. The model also filters unsafe frames (nudity, violence). Thumbnail bytes live on the same CDN as the video; size ~50 KB per candidate. Cross-link: database Module 9 (NoSQL) for storing per-video thumbnail metadata.

Dropbox-style Sync (continuous)

Scope. Like Google Drive but focused on native desktop + mobile clients that keep local folders in sync. Think Dropbox, OneDrive, iCloud Drive. 500M users; sync latency p99 < 5s; must work on LAN-speed and dial-up alike.

Dropbox sync pipeline

Hardest sub-problem: sync coordination

A Dropbox client watches a local folder; when a file changes, it must push the change to the server, then the server must push to every other client of that user. And do this for 500M users without thundering-herd fan-out.

Block-level dedup (same as Drive). Only changed chunks move.
Notification channel: long-poll or WebSocket per device; server pushes "namespace X changed" — client pulls the diff.
Metadata journal: ordered log of operations per namespace (a folder tree). Clients keep a local copy of the journal cursor and replay new operations.
Conflict resolution: last-write-wins with a "conflict copy" file if both sides edited while offline.

# Client-side sync loop (conceptual)
def sync_loop(ns):
    cursor = state.get_cursor(ns)
    while True:
        events = notify_service.long_poll(ns, cursor, timeout=60)
        for ev in events:
            if ev.kind == "added" or ev.kind == "modified":
                chunks = metadata.get_chunks(ev.file_id)
                for c in chunks:
                    if not local.has(c.hash):
                        blob = blob_store.get(c.hash)
                        local.write_chunk(c.hash, blob)
                local.assemble(ev.file_id, chunks)
            elif ev.kind == "deleted":
                local.remove(ev.file_id)
            cursor = max(cursor, ev.seq)
        state.set_cursor(ns, cursor)

Back-of-envelope

Notify connections: 500M devices online ÷ 1M per long-poll pod = 500 pods. Same as chat.
Metadata DB: sharded by user namespace; journal is append-only so a KV store like FoundationDB fits.
Network savings: block-level dedup typically saves 30–50% bytes; rolling hash delta saves another 50–90% on edit-in-place files. Hugely important when users share 4GB video projects.
Cross-device sync latency: 3–5s p99 realistic; the bottleneck is usually the slower client's last-mile upload.

Failure modes

Sync loops: a client that incorrectly re-emits events can pummel the server. Add fingerprints and rate-limit per device.
Partial cross-platform compat: case sensitivity, filename legality, metadata. Normalize on the server; record original name.
Offline deletes vs remote edits: resolve with "delete wins after T or unless recently edited remotely." Defaults surprise users; make the rule visible.
Large first sync: a new device joining a 500 GB folder. Allow background sync; don't block UI.

Client-server notify + long-poll protocol

Dropbox conflict resolution timeline

Deep dive: block-level sync

Dropbox's "delta sync" splits files into 4 MB blocks (classically fixed-size; newer builds use content-defined chunking). On file save, the client hashes each block and sends only blocks whose hash is new to the server. When downloading, it fetches only the blocks it doesn't already have locally. A 2 GB Photoshop file save that modifies 10 MB of pixels uploads ~10 MB, not 2 GB. The server keeps an immutable block store keyed by content hash (see Google Drive section). Cross-user dedup saves global storage; per-user dedup (common on enterprise) prevents the "I can detect if you have file X" side-channel.

Deep dive: bandwidth throttling

A naive sync client will saturate the user's uplink and ruin their video call. Production clients throttle:

Per-connection upload cap (user-configurable: "Don't use more than 1 Mbps").
TCP congestion signal: watch RTT inflation on the control connection; back off uploads if latency climbs — the "LEDBAT" style (BitTorrent's approach).
Activity sensing: pause uploads during video calls (detected via OS APIs on macOS/Windows).
Priority queue: small files (docs, recently-opened) pushed before multi-GB backups.

Deep dive: LAN sync

Two devices of the same user on the same LAN can transfer directly instead of round-tripping to the cloud. Discovery via mDNS/Bonjour; a device broadcasts _dropbox-lansync._tcp and advertises which block hashes it holds. Peer asks "do you have block X?"; peer responds with block bytes over the LAN. Gigabit LAN >> 50 Mbps cable upload; a household that shares a 10 GB video between a laptop and a NAS finishes in ~80 s vs 30 min over cloud. Security: LAN sync uses the same E2E pairwise keys as cloud; a malicious LAN peer can't inject garbage because blocks are content-addressed. Cross-link: networking Module 5 for mDNS and zero-config discovery.

Netflix Streaming / Open Connect

Scope. 260M paying subscribers; 250M hours watched/day; catalog ~15k titles; 4K HDR up to 25 Mbps per stream; p99 start-play < 2 s globally, including subscribers in Jakarta and Lagos. Unlike YouTube, Netflix owns its CDN (Open Connect) end-to-end.

Requirements. 250M concurrent peak during global Squid-Game-type releases; 200 Tbps aggregate egress at peak; 95%+ traffic must be served from within-ISP cache appliances to keep cost and latency tolerable.

Open Connect push CDN architecture

Adaptive bitrate ladder selection

Deep dive: Open Connect push model

Unlike Cloudflare or Fastly (pull CDNs, lazy fill on miss), Netflix pushes content before it's watched. Each night during the ISP's off-peak window, Netflix's "fill service" uploads the next day's hot-catalog to each Open Connect Appliance (OCA) — a 1U server sitting in an ISP's rack running FreeBSD + NGINX + nvme. Prediction model uses viewer history + time-of-day + day-of-week + global release calendar to pick what 10 TB of content each OCA should hold from a catalog that's ~200 TB encoded. Hit rate at the OCA: 95%+. The 5% misses are filled via IX peering back to AWS; the cost is tail-latency on rare content.

Deep dive: adaptive bitrate

Client player (built on the Netflix DASH spec) measures download time of each 4-second segment and estimates available bandwidth. Picks the highest rung whose bitrate < 80% of measured bandwidth (safety margin). Buffer target: 30 s forward buffer. If buffer drops below 10 s, aggressively step down; if it grows beyond 60 s, step up. The per-title ladder means each title has its own (resolution, bitrate) points — a 60 fps action scene gets 5.5 Mbps while a static talking-head documentary gets the same perceived quality at 2.5 Mbps. Cross-reference Module 4 (load balancing) for how Netflix steers clients to the "best" OCA via manifest URLs returned by the control plane.

Deep dive: predictive caching

Prediction features per OCA: (1) historical view counts at this OCA over trailing 7/28 days, (2) global trend velocity (is title X trending?), (3) editorial calendar (a new season of Stranger Things launches tomorrow), (4) recommendation spillover (if a subscriber finishes season 1, P(start season 2) is 90%). Cost function: minimize expected upstream fill bytes subject to per-OCA disk capacity. Solved nightly as a multi-knapsack problem per OCA. Wrong predictions = origin fill = degraded UX; over 10+ years Netflix has tuned this relentlessly.

Back-of-envelope

Peak egress: 200 Tbps. 95% from OCAs = 190 Tbps distributed across ~10,000 appliances = ~20 Gbps/OCA. Each OCA is a 2×100GbE box.
Storage per OCA: 10–30 TB NVMe. Catalog hot-set ~200 titles/day.
Nightly fill: 10 TB/OCA × 10k OCAs = 100 PB/night moved. That's why fills run 3 AM local — ISP backbones are empty.
Transcode cost: per-title ladder with 10 rungs × HDR variant × 3 codecs (AVC/HEVC/AV1) = 60 renders per title. 15k catalog = 900k render jobs. Run offline on spot instances.

Failure modes

OCA dies during prime time: traffic fails over to sibling OCA at same IX or to cloud-fill. Manifest has multiple CDN base URLs for automatic client-side failover.
Hot unpredicted release: "Baby Reindeer" goes viral overnight; prediction didn't flag it. Emergency fill triggered; cloud-fill backs the gap until OCAs converge.
ISP peering dispute: rare but real; bypass via alternative transit.
Codec fragmentation: old Smart TVs don't support AV1. Ship multiple codecs; pay the storage cost.

Stack Overflow-style Q&A

Scope. Public Q&A with reputation, tagging, voting, moderation, accepted answers. 100M MAU; 20M registered users; 25k new questions/day; 100k new answers/day; 2B page views/month; 700 RPS peak reads (heavily cached); full-text search across 25M questions.

Requirements. Strong consistency on reputation and vote counts (money-adjacent to users); tolerable staleness on view counts; full-text search P95 < 200 ms; moderation queue SLA 5 min for flagged content.

Q&A service topology

Inverted index lookup for full-text search

Deep dive: reputation calculation

Every upvote on a user's answer = +10 rep; downvote = −2; accepted answer = +15 to answerer, +2 to asker; edit approval = +2. Computed per-vote in a transaction on the user_reputation table — strong consistency matters because reputation unlocks privileges (at 50 rep you can comment anywhere; at 3k you can moderate). Anti-cheat: daily cap (200 rep/day from votes); voting rings detected by correlating vote-graph anomalies offline. Reputation is denormalized onto the user row so that user-page renders don't aggregate every vote. A nightly job recomputes from vote history to catch drift (classic reconciliation pattern).

Deep dive: moderation queue

Flags land in mod_queue with priority = flag-weight × user-rep-of-flagger × recency. Automated rules auto-delete (spam ML classifier score > 0.95), auto-approve (rep > 10k user's edit to their own post), or route to human moderators. Humans see a sorted queue with context (post history, author rep, flag text). SLA 5 min because spam posts at high visibility damage brand. Mod actions are themselves logged and reviewable — a bad mod can be un-modded. Cross-link: Module 11 for audit trails.

Deep dive: tag subscriptions and inverted index

Each question carries 1–5 tags. Users subscribe to tags; "unanswered in my tags" is a primary use case. The subscription × tag index is a many-to-many store: tag -> subscribers and user -> tags. Lookup "unanswered questions in my tags" = union over tags of sorted posting lists, filter by is_answered=false. Materialized view + incrementally maintained on new question post. Full-text search uses Elasticsearch; each document is a post with fields (title, body, tags, answers.body); scoring is BM25 + reputation prior + recency decay. Cross-reference DSA Module 5 for Bloom filters (used to quickly rule out missing terms before hitting posting lists) and Module 10 for search stack architecture.

Back-of-envelope

Content size: 25M questions × 5 KB avg = 125 GB; 50M answers × 2 KB = 100 GB. Fits on one SQL box (Stack Overflow famously runs on 9 DB servers total).
Cache hit: 95%+ page views are anonymous views of popular pages — entire HTML page caches to CDN for 60s. Real DB reads: ~5% × 700 RPS = 35 RPS. Trivial.
Elasticsearch: 25M docs × ~5 KB = 125 GB primary; with replicas + analyzers ~500 GB cluster. 10 nodes handles it.
Reputation updates: 1M votes/day = 12 writes/s average. Easy.

Failure modes

Hot question: a viral post on HackerNews sends 50k concurrent viewers. CDN absorbs it; origin sees only a handful of revalidation requests.
Mod action race: two mods delete the same post. Idempotent action log; second mod sees "already deleted."
Search staleness: new question visible in search only after Elasticsearch indexes (usually <1 min). Acceptable. Read-your-own-writes served from SQL for own posts.
Vote bomb: someone scripts 10k downvotes. Rate-limit + anomaly detector + reverse.

Instagram Stories

Scope. Users post ephemeral photos/videos that auto-expire after 24 hours. Viewers tap through friends' stories in sequence; the poster sees per-story viewer lists. 500M DAU post or view; 1B stories posted per day; peak 20M concurrent viewers; must feel instant (p95 first-frame < 500 ms).

Requirements. TTL data model (24h expiry, no retention beyond); viewer list per story updated in real time; image/video processing (1080p, multiple aspect ratios); stickers and effects pre-rendered server-side or client-side depending on complexity.

Stories data model and processing

Viewer-count aggregation at scale

Deep dive: TTL data model

A story row has a hard 24-hour TTL. Implementation: stored in a datastore that supports TTL natively (Cassandra with USING TTL 86400, or DynamoDB with a TTL attribute). At read time the DB checks expiry and elides rows. Blob store retains bytes for 48 h (grace window) before actual deletion — this protects against late-arriving eventual-consistency reads. Metadata TTL + blob TTL must be independent to avoid a race where metadata expires first and the blob is orphan-GC'd before anyone notices. Archive: if the poster opts into "archive all stories," a nightly job copies expiring rows into a permanent table with different access patterns.

Deep dive: multi-region push

When a poster in Jakarta uploads, her followers span every continent. Two design options:

Lazy pull: poster writes to her home region; readers fetch across regions on first view. Simple but 200–400 ms tail latency for distant readers.
Eager push: upload replicates metadata to all regions via a global pub/sub (Kafka MirrorMaker, Pulsar Geo-Replication, or Meta's proprietary Wormhole). Blob bytes replicate via CDN origin-pull or proactive push.

Instagram uses eager metadata + lazy blob via CDN. Metadata is tiny (a few KB); pushing it to every region is cheap. Blobs are big; let CDN edges pull them on demand. P95 first-frame < 500 ms requires the CDN to have the blob close to the viewer within seconds of upload — proactive CDN warm for followers in bulk regions.

Deep dive: viewer count at scale

A celebrity post can attract 10M viewers in an hour. Writing 10M rows (one per viewer) is fine, but reading back the viewer list (poster sees "Alice, Bob, ... 9.8M others viewed") needs efficient pagination. Solution: write events into Kafka partitioned by story_id; a Flink consumer per partition maintains a sorted Redis set (score = view timestamp). Poster's UI paginates through the set lazily. For the overall count, use HyperLogLog to dedupe repeat views cheaply; for the list, the authoritative source is Kafka log — nightly batch job reconciles against Redis for any drift. Cross-reference Module 7 for Kafka partitioning and DSA Module 5 for HyperLogLog.

Deep dive: image/video processing

On upload, server-side pipeline:

Validation: format sniffing, malformed-file guard (images crafted to exploit decoders).
Resize ladder: 1080p, 720p, 320p (thumbnail); WebP/HEIC encodes for modern clients, JPEG fallback.
Video transcode: 15-second clips at 720p60 and 480p30; H.264 for compatibility, HEVC for efficient streaming.
NSFW / PhotoDNA scan: run image classifiers for policy violations; route to moderation queue if triggered.
Stickers/effects: simple AR effects applied client-side on upload; complex ones (e.g., background blur with a segmentation model) run on-device using CoreML/MLKit; server receives the final render.

Pipeline is asynchronous: upload returns as soon as the original blob is durable (~100 ms) while downstream variants populate within a few seconds. Story can be "viewable" before all variants are ready; the player falls back to the original if a ladder rung isn't yet available.

Back-of-envelope

Daily uploads: 1B stories/day ÷ 86400 = ~12k uploads/s avg, ~50k peak.
Bytes per story: 2 MB avg (image + small video). 1B × 2 MB = 2 PB/day raw; after 24h TTL storage footprint is ~2 PB rolling.
Ingress bandwidth: 12k uploads/s × 2 MB = 24 GB/s = 200 Gbps. Regional ingest endpoints, CDN upload.
View events: 1B stories × 50 viewers = 50B views/day = 600k views/s. Kafka with 100 partitions handles this (6k per partition).
Viewer-list storage: 1B stories × 50 viewers × 16 bytes = 800 GB rolling (24h). Fits Redis cluster easily.

Failure modes

Celebrity view burst: 10M viewers in minutes. Kafka partition for that story hot-spots; mitigate with sub-partitioning (hash of viewer_id). Cross-link: database Module 8 for hotspot mitigation.
TTL miss: story appears in user's feed but blob already GC'd. Retain blobs 48h (grace). Alert if read-after-expire rate > 0.01%.
Client re-view counting: user scrolls back; must not double-count. Dedupe in Flink with a per-viewer-per-story Bloom filter over the 24h window.
Moderation latency: offensive content spreads for minutes before human review. Use automated pre-screens (CNN classifier) that gate visibility on borderline content until human-cleared.

Module 13

Interview Cheat Sheet

Every table, every number, every one-liner you need to pull up in an interview, in one place.

Latency Reference

Operation	Time	Mnemonic
L1 cache	0.5 ns	"free"
Branch mispredict	5 ns	"10 L1s"
Main memory	100 ns	"RAM is not free after all"
Mutex lock/unlock	25 ns
Compress 1KB (Snappy)	3 μs
Send 1KB over 1Gbps	10 μs
SSD random 4KB read	150 μs
Read 1 MB RAM sequential	250 μs
Intra-DC round trip	500 μs	"half a millisecond"
Read 1 MB SSD sequential	1 ms
HDD seek	10 ms	"ten thousand of a second"
Read 1 MB HDD sequential	20 ms
Transcontinental round trip	~80 ms	"SF ↔ NYC"
Transatlantic round trip	~150 ms	"CA ↔ NL"

Capacity rules of thumb

Per-server QPS

Nginx static: 50k RPS per core
Node/Go HTTP: 10–30k RPS
Postgres OLTP: 1–5k QPS
Redis: 100k ops/s single-thread
Kafka: 500 MB/s sequential write
MySQL: 2–10k QPS

Bandwidth

Home fiber: 1 Gbps
DC NIC: 25–100 Gbps
TOR switch: 3.2 Tbps
WAN: 10–100 Gbps
Inter-region: 10–40 Gbps typical

Unit sizes

Tweet: ~200 B (1 KB w/ envelope)
Chat msg: ~500 B
URL: ~100 B
Photo: ~300 KB
1-min 1080p H.264: ~8 MB
1-min 4K H.265: ~30 MB
86,400 seconds in a day

Load Balancing Algorithms

Algorithm	How	Good for	Avoid when
Round robin	Rotate in order	Identical backends, short requests	Variable request cost
Weighted RR	Weights per backend	Heterogeneous fleet, canary	Never; always start here
Least connections	Pick fewest-active	Long-lived/variable requests	Stateless short HTTP
Least latency	Pick fastest recent	Mixed hardware, geographic mix	Slow signal, noisy
IP hash	`hash(ip) mod N`	Poor man's stickiness	NAT / mobile; rebalance pain
Consistent hash	Ring w/ vnodes	Cache affinity, sharded stateful	Perfect load balance needed
Power-of-two choice	Pick 2, take lighter	Large fleet, no coordination	Tiny fleet (<5)
Random	Any backend	Simplicity, no stats	Anything production

Cache Patterns

Pattern	Read path	Write path	Consistency	Use when
Cache-aside	Cache miss → DB → fill	DB write → cache delete	Eventually consistent	Default; most read-heavy services
Read-through	Always via cache library	Same as cache-aside	Eventually consistent	Framework support; cleaner code
Write-through	Cache hit	Sync cache + DB	Strong within cache	Low-churn reference data
Write-back	Cache hit	Async DB write	Risky on cache loss	Write-heavy hot counters
Write-around	Cache miss → DB → fill	DB only, skip cache	Same as cache-aside	Write-once read-rare
Refresh-ahead	Cache hit	Proactive refresh near TTL	Smoother p99	Predictable hot keys

Invalidation shortcuts

TTL: bound staleness, no bookkeeping.
Delete on write: fresh after the next read, but races with concurrent writes.
Version in key: user:42:v17 — old keys orphan.
Tag invalidation: Varnish/Fastly; purge by tag.
CDC-driven: Debezium/DynamoDB Streams → cache updater.

CAP / Consistency Examples

CP (consistent under partition)

Spanner — PC/EC, TrueTime
HBase — strongly consistent rows
ZooKeeper — ZAB protocol
etcd — Raft
Consul KV — Raft
MongoDB majority writes

AP (available under partition)

Cassandra — tunable quorum
DynamoDB (default)
Riak — dotted version vectors
CouchDB — MVCC + replication
Cosmos DB session / eventual

PC/EL (strong if not partitioned, else fast)

RDS multi-AZ
MySQL semi-sync replication
PostgreSQL sync replica

Consistency models

Linearizable: one global real-time order
Sequential: one global order, no RT
Causal: respect happens-before
Read-your-writes: session property
Monotonic reads: never go back in time
Eventual: converge if writes stop

Interview Templates

Interview HLD answer outline

The 4-step back-of-envelope

Users & usage. DAU, MAU, actions/day/user.
Data sizes. Bytes per item, items per user, retention.
QPS & bandwidth. Divide by 86,400; multiply by peak factor 3.
Capacity. Divide peak by per-server capacity; add HA buffer; × regions.

The HLD checklist

Clarify scope and the functional requirements you will actually solve.
State non-functional requirements with numbers (QPS, latency p99, durability, RTO/RPO).
Back-of-envelope (Module 1).
API shape (REST vs gRPC; the main endpoints).
Data model and storage choice (Module 6).
High-level diagram with LB, app tier, cache, DB, async workers.
Deep-dive the hardest sub-problem.
Failure modes and what happens in a partition.
Capacity and cost at steady state and peak.
Monitoring (Module 11) — how you know it is broken.

Common trade-offs you should surface unprompted

Strong vs eventual consistency

Strong: correctness wins; expect slower writes
Eventual: throughput wins; expect reconciliation work
Mix: strong for the business invariant, eventual for the rest

Push vs pull (fan-out)

Push: fast reads, expensive writes; bad for fan-out explosions
Pull: cheap writes, expensive reads; scales celebrities
Hybrid: switch at threshold (Twitter, Instagram)

SQL vs NoSQL

SQL: joins, transactions, flexible queries
KV: blazing fast by key, nothing else
Wide-column: write-heavy time-series
Document: schema flex, per-doc ACID

Sync vs async replication

Sync: zero RPO, higher write latency
Async: fast writes, possible data loss on failover
Semi-sync: ack one replica, rest async

Rate limiter

Token bucket: burst-friendly, usual default
Leaky bucket: smooths output (queues)
Sliding window: accurate but expensive
Fixed window: cheapest, 2x burst at boundary

Cache location

Browser: bytes saved, user staleness
CDN: regional, auth-free paths
App: Redis, personalized data
DB buffer pool: free for the taking

Power phrases

"Let me clarify the scope before I draw anything."
"Back-of-envelope first: DAU → QPS → bytes → servers."
"Strong consistency costs a round trip — can the product tolerate eventual?"
"What is the shard key?"
"How do we know it is broken? SLOs and the golden signals."
"What happens when the cache is cold?"
"Can the client retry safely? Idempotency key."
"Where does partition tolerance bite us?"

Master Notebook:System Design

Latency Numbers Every Programmer Should Know

The canonical latency table, expanded

Why log scale matters

Storage & Bandwidth Scales

Byte scales you should memorize

Bandwidth scales

QPS & Server Sizing

From DAU to peak QPS

Worked example: Twitter-like feed

Per-server rules of thumb

Estimation Playbook

When the interviewer pushes back

CAP Theorem

Reading CAP correctly

PACELC Extension

Consistency Models

The spectrum

Quorum math

Choosing the Right Model

The practical rule

Vertical vs Horizontal Scaling

Stateless Services

What "stateless" actually means

Sticky Sessions & State

Autoscaling

Signals

Gotchas

L4 vs L7

Algorithms

The canonical algorithms

Health Checks

Health check design rules

Global & Anycast

Cache Hierarchy

Hit-rate math

Write-Through / Back / Around

Invalidation Strategies

Stampedes & Failures

Cache stampede (dog-piling)

Cache penetration

Hot key

Cache avalanche

SQL vs NoSQL vs NewSQL

Replication

Topologies

Synchronous vs asynchronous

Sharding

Shard key strategies

Cross-shard problems

Decision Tree

Queues vs Pub/Sub

Kafka Log Model

Partitioning discipline

Delivery Semantics

Idempotency is the real answer

Backpressure

Load shedding patterns

Object / Block / File

Durability numbers

Hot / Warm / Cold Tiers

CDN & Edge

What belongs at the edge

REST

Interview-level REST checklist

gRPC

GraphQL

Idempotency

Idempotency keys

Rate Limiting

Algorithms

Returning throttled responses

Inverted Index

Anatomy

Ranking: TF-IDF and BM25

Ranking & Relevance

Evaluation metrics

Vector Search

ANN (approximate nearest neighbor) algorithms

Hybrid search

Master Notebook:
System Design