Master Notebook:
ML Systems

Production ML infrastructure and case-study-driven design for FAANG SWE + ML engineer interviews — feature stores, embeddings, recommenders, ranking, LLM serving, and training at scale.

10 Modules • 20+ Visualizations • 7 Case Studies • Complete Cheat Sheet

Module 1

ML Fundamentals Recap

A fast re-grounding in the three things every ML systems interview assumes: tasks & targets, loss functions, and the training loop. No calculus derivations — just the moving parts you must be able to draw on a whiteboard.

Tasks & Targets

An ML system exists to predict a target from a feature vector. The interviewer's very first clarifying question — "what are we predicting?" — is not rhetorical; the answer determines the loss, the metric, and whether you can even collect labels. There are four shapes of prediction problem that cover 95% of interview scenarios.

  • Binary classification — the label is 0/1. Examples: click, spam, churn, fraud. Output layer is a single sigmoid; loss is binary cross-entropy.
  • Multi-class — the label is exactly one of K classes. Examples: MNIST, language ID, intent routing. Softmax head; categorical cross-entropy.
  • Regression — the label is a real number. Examples: ETA, price, ad CPC, latency. Linear output; MSE / MAE / Huber.
  • Ranking / retrieval — there is no single "answer"; you return an ordered list. Examples: search, recommendation, ads. Loss acts on pairs or lists of items; metric is NDCG / MRR / recall@k.

Each of these maps to a different system shape. A click predictor looks like "feature store → DLRM → logistic head → calibration → bidder." A ranker looks like "candidate generator → scorer → re-ranker." If you do not name the task shape in the first two minutes of the interview, the rest of your design will drift.

Losses You Must Know

Losses are the contract between the task and the gradient. Memorize these six; almost every production system is a composition of them.

LossTaskFormWhen to reach for it
Binary cross-entropyClick / fraud / CTR-y·log(p) - (1-y)·log(1-p)Default for 0/1 labels; probabilistic interpretation needed for bidding.
Categorical cross-entropyLanguage ID, image class-Σ y_k·log(p_k)Exclusive K classes, softmax head.
MSE / L2ETA, price(y - ŷ)²Symmetric errors, penalize outliers; watch for label skew.
HuberETA with outliersL2 near 0, L1 at tailsMSE's smoothness with L1's robustness.
HingeSVMs, margin classifiersmax(0, 1 - y·ŷ)Sparse support vectors; rarely production now.
Contrastive / NCE / InfoNCEEmbeddings, retrieval-log(e^s⁺ / Σ e^s)Trains two-towers; pushes positive pair together, negatives apart.
Listwise (LambdaRank, ListNet)Search rankingΔ-NDCG weighted pairwiseOptimizes the ranking metric directly.

Minimal PyTorch reference

import torch
import torch.nn.functional as F

# 1) Binary cross-entropy (logit form is numerically stable)
logits = model(x)                        # shape [B]
loss = F.binary_cross_entropy_with_logits(logits, y.float())

# 2) Categorical (softmax over K)
logits = model(x)                        # shape [B, K]
loss = F.cross_entropy(logits, y.long())

# 3) Contrastive InfoNCE (batch negatives, temperature 0.07)
q = F.normalize(query_tower(x_q), dim=-1)    # [B, D]
k = F.normalize(item_tower(x_k),  dim=-1)    # [B, D]
logits = q @ k.T / 0.07                      # [B, B]
labels = torch.arange(q.size(0), device=q.device)
loss   = F.cross_entropy(logits, labels)     # diagonals are positives

Gradient-Based Training

Every model in this notebook — from two-tower to 70B transformer — is trained by the same five-line loop: sample a batch, forward, compute loss, backward, step the optimizer. What varies is the optimizer, the batch size, and the parallelism. The core loop is non-negotiable.

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

for epoch in range(epochs):
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad(set_to_none=True)
        logits = model(x)
        loss   = loss_fn(logits, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # stability
        optimizer.step()
    scheduler.step()

Optimizer picks, by era

  • SGD + momentum — still the academic default for ResNets on ImageNet; cheap memory.
  • Adam / AdamW — default for transformers and anything with sparse features. Costs 2× weights in state (first/second moment).
  • Adafactor / 8-bit Adam — when optimizer state memory is the bottleneck on 10B+ params.
  • Lion / Sophia — newer, faster convergence claims; not yet standard in interview answers.
Loss landscape (grid heatmap, lower = better)

Learning rate: the one hyperparameter that matters

If you have one hour to tune a model, spend it on the learning rate. Rule of thumb: Adam likes 1e-4 to 1e-3 for transformers; SGD likes 0.1 with momentum 0.9 for conv nets. A cosine schedule with 1–5% warmup is the modern default. Gradient clipping at norm 1.0 makes transformer training not blow up on the first few steps.

Overfitting & Regularization

Overfitting is when the model memorizes the training set and generalizes poorly. Every regularization technique in production ML is one of three ideas: shrink the hypothesis space, add noise, or stop early.

  • L2 (weight decay) — penalize ||w||². Shrinks weights toward zero; built into AdamW. Default 1e-2 for transformers.
  • L1 — penalize ||w||₁. Produces sparse weights; useful when you want feature selection.
  • Dropout — randomly zero p fraction of activations during training. p=0.1 is default for transformers; p=0.5 for older FC nets.
  • Label smoothing — replace one-hot targets with (1-ε, ε/(K-1)...). Prevents overconfident logits; hurts calibration slightly.
  • Data augmentation — random crops/flips for images, token masking for text. Free regularization via more effective samples.
  • Early stopping — watch the val loss; stop when it starts climbing. Free, always works.

Bias-Variance

Every interview eventually asks "what's bias-variance?" The short answer is: bias is how wrong your model's best guess is on average; variance is how much that guess wobbles when you resample the training set. Total error ≈ bias² + variance + irreducible noise.

SymptomDiagnosisFix
Train loss high, val loss high, similarHigh bias (underfit)Bigger model, more features, less regularization, train longer.
Train loss low, val loss highHigh variance (overfit)More data, stronger regularization, simpler model, early stop.
Val loss low, online metric badDistribution shift / leakageCheck train/serve skew, feature freshness, label definition.
Loss NaN after step 50Exploding gradientsClip grad norm, drop lr 10×, check for division by zero.
Module 2

Feature Engineering & Feature Stores

Features are where ML systems actually earn their keep. A feature store is the production plumbing that makes sure training and serving compute the same numbers — the single most common silent-failure mode in real ML.

Feature Types

Before encoding, classify every field. The classification dictates storage, memory, and which encoder to reach for.

  • Numeric continuous — price, age, latency. Needs normalization (z-score, min-max, or quantile).
  • Numeric discrete / count — num_clicks_7d, session_length. Log1p transform is the default.
  • Categorical low-card — country, device type. Cardinality < 10k → one-hot or learned embedding table.
  • Categorical high-card — user_id, item_id, url. Cardinality 10⁶+ → hashing trick or embedding table with row-eviction.
  • Text — titles, descriptions. Tokenize (BPE/WordPiece) → token ids → embedding.
  • Embedding / dense — already a vector (e.g., pretrained image CLIP features). Pass through as-is.
  • Sequence — last-N clicks, last-N searches. Feed through attention, mean-pool, or GRU.

Encoding

Every category must become a number before a model can consume it. The four canonical encoders, in order of when to reach for them:

  • One-hot — cardinality ≤ ~500. Simple, but size grows with cardinality.
  • Label encoding + embedding table — assign each category an integer id, look up a D-dim vector. Standard for anything 1k–100M cardinality. D typically 16–256.
  • Hashing trick — h(category) mod B. Collisions, but constant memory. Used by Vowpal Wabbit, ads click models. B = 2²⁰ is common.
  • Target encoding — replace each category with the mean of the target for that category. Powerful but leaks labels — must be computed per-fold with smoothing.
# Hashing trick in 6 lines — O(1) memory per feature
import numpy as np
HASH_BUCKETS = 1 << 20          # 1,048,576 buckets

def encode(category: str) -> int:
    return hash(category) % HASH_BUCKETS

# Embedding lookup for this id
import torch.nn as nn
emb = nn.Embedding(HASH_BUCKETS, 32)      # 32-dim embedding, ~128 MB
vec = emb(torch.tensor([encode("us-west-2")]))

Normalization

Numeric features must live on comparable scales or gradient descent will follow the dominant feature and ignore the rest. Three canonical choices:

  • Z-score: (x - μ) / σ. Default. Assumes roughly Gaussian distribution.
  • Min-max: (x - min) / (max - min). Bounded to [0, 1]; fragile to outliers.
  • Quantile / rank: replace x with its empirical percentile. Robust; destroys scale info.
  • Log1p: log(1 + x) for heavy-tailed counts (session time, clicks).

Rule: always fit normalizer stats on train only, persist them, and reuse at serve time. Computing μ,σ from the live request stream will silently leak future data.

Train/Serve Skew

Train/serve skew is the #1 silent killer of real ML systems. The model was trained on features computed one way, but production computes them slightly differently — so the live numbers are out-of-distribution and the model silently degrades.

The four causes, in order of frequency

  • Code duplication — offline feature code in Spark, online in Python. The two implementations drift. Fix: share one codebase, or use a feature store that runs both.
  • Time travel — offline feature uses "last 7 days of clicks as of now" during training, but "as of now" means the label time, not the prediction time. Fix: point-in-time joins.
  • Missing data handling — offline imputes missing with 0; online passes raw null. Fix: one imputation policy.
  • Normalizer drift — stats recomputed on new data without retraining the model. Fix: bind normalizer version to model version.

Feature Store Architecture

A feature store is a versioned database of ML features that serves both training and serving from the same code path. The canonical architecture (Feast, Tecton, Michelangelo) has four components.

Feature store: offline / online dual-write

Offline vs online stores

AspectOffline storeOnline store
MediumParquet on S3 / GCS, warehouse (BQ, Snowflake)Redis, DynamoDB, Cassandra, ScyllaDB
Access patternBatch scan, point-in-time joinsKey lookup by entity id
LatencySeconds to minutesp50 1 ms, p99 5 ms
SizePetabytes, full historyGB to low TB, latest snapshot
UseTraining, backfills, analyticsServing at prediction time

Online serve SLO

The online store must return a batch of features (for one user + N candidates) in under 10 ms p99, because the feature fetch is one hop in a request that must finish in ~200 ms total. Redis / ScyllaDB with multi-get pipelining is the standard answer.

# Feast-style point-in-time join (pseudocode)
#   label_df  has columns: user_id, ts, clicked
#   feature_view 'user_activity' has rows (user_id, ts, n_clicks_7d, ...)
#
# The join returns the latest feature row where feature_ts <= label_ts.
from feast import FeatureStore
store = FeatureStore(repo_path=".")

training = store.get_historical_features(
    entity_df=label_df,
    features=[
        "user_activity:n_clicks_7d",
        "user_activity:avg_session_min",
        "item_stats:ctr_30d",
    ],
).to_df()
Module 3

Embeddings & Vector DBs

Every modern retrieval system — search, recommendations, RAG — runs on two primitives: turn everything into a vector, then find the nearest neighbors fast. This module is about both halves.

What Is An Embedding

An embedding is a D-dimensional vector that represents a discrete object (word, user, item, image, document) such that similar objects are close in vector space. D is typically 32 for small-scale recs, 128 for items, 384–768 for sentence embeddings, and 1024–4096 for state-of-the-art text retrieval.

  • Two objects being "similar" means their cosine similarity is high (or their L2 distance is small).
  • The embedding table is just an nn.Embedding(V, D) — a V×D matrix. Row i is the embedding for token i.
  • Embeddings are learned: start random, train under some loss that pulls similar objects together and pushes dissimilar ones apart.

How Embeddings Are Trained

There are four training regimes you will be asked about.

Word2Vec / skip-gram (2013)

Predict context words from a center word. Use negative sampling: for each (center, context) positive, sample K=5–20 random words as negatives. Loss is binary cross-entropy. Trains in minutes on CPU for 10⁹ tokens. Still the baseline for cheap text embeddings.

Matrix factorization (2007)

Given a user-item rating matrix R (sparse), factor it as R ≈ U · Vᵀ. User embeddings are rows of U, item embeddings are rows of V. Loss: MSE over observed entries + L2 regularization. Solve with alternating least squares or SGD. Still the right baseline for a recommendation system interview question.

Two-tower contrastive (2019+)

Two neural networks — a query tower and an item tower — each produce a D-dim embedding. Pull (query, positive-item) pairs close, push (query, negative-item) apart with InfoNCE loss. Negatives are sampled from the batch (each row's items are positives for that row and negatives for every other row). This is the modern retrieval training recipe.

SimCLR / contrastive for images (2020)

Take an image, apply two random augmentations → x, x'. Train a network so that f(x) and f(x') are close, while f(x) and f(y) (another image) are far. InfoNCE loss with a large batch (4k–32k). Produces image embeddings without labels.

# Two-tower training loop — the retrieval-system interview special
import torch, torch.nn as nn, torch.nn.functional as F

class Tower(nn.Module):
    def __init__(self, vocab, d_in=64, d_out=128):
        super().__init__()
        self.emb = nn.Embedding(vocab, d_in)
        self.mlp = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(),
                                 nn.Linear(256, d_out))
    def forward(self, x):
        return F.normalize(self.mlp(self.emb(x).mean(1)), dim=-1)

query_tower, item_tower = Tower(10_000), Tower(1_000_000)
opt = torch.optim.AdamW(list(query_tower.parameters()) +
                        list(item_tower.parameters()), lr=1e-3)

for q_batch, pos_item_batch in loader:           # both shape [B, seq]
    q = query_tower(q_batch)                     # [B, 128]
    k = item_tower(pos_item_batch)               # [B, 128]
    logits = q @ k.T / 0.07                      # [B, B]
    labels = torch.arange(q.size(0))
    loss = F.cross_entropy(logits, labels)
    opt.zero_grad(); loss.backward(); opt.step()

ANN Indexes

Given a query vector q and N item vectors, finding the exact nearest neighbors is O(N·D) — unacceptable at N = 10⁸. Approximate Nearest Neighbor (ANN) indexes buy ~100× speedup for 1–5% recall loss. The three you must know:

HNSW — Hierarchical Navigable Small World

A layered graph: top layer is sparse with long edges, bottom layer is dense. Search greedily descends from top. Build cost high, query cost O(log N). Recall@10 > 0.95 with default params. The current production default (used by Weaviate, Milvus, Qdrant, pgvector).

HNSW layered graph (search descends top-down)

IVF — Inverted File Index

K-means cluster all vectors into n_list centroids (typically √N, so ~10k for 100M vectors). At query, compute distance to every centroid, visit the n_probe closest (e.g. 16), and exhaustively scan their cluster. Recall tunable by n_probe. Memory-efficient; the default in Faiss.

PQ — Product Quantization

Split each D-dim vector into M chunks of D/M dims. K-means each chunk independently with 256 centroids → each chunk becomes 1 byte. A 768-dim float32 vector (3 KB) compresses to M=96 bytes. 32× compression. Usually combined with IVF → IVF-PQ.

IndexBuildQueryMemoryRecallUse
Flat (brute force)FreeO(N·D)4 B × N × D1.0N < 10⁵ or ground truth.
HNSWO(N log N)O(log N)~8× flat0.95+10⁶–10⁸, recall priority.
IVFK-meansO(n_probe · N/n_list)~flat0.85–0.95Tunable, moderate.
IVF-PQDouble quant~O(n_probe · N/n_list)~1/32 flat0.80–0.9010⁹+ vectors, RAM-bound.

Vector DB Choice

SystemIndexScaleSweet spot
pgvectorIVF + HNSW< 10⁷Already have Postgres; one fewer service to run.
Pinecone (managed)Proprietary graph10⁹Zero-ops, SaaS, fast go-to-market.
WeaviateHNSW10⁹Hybrid search (keyword + vector), GraphQL API.
Milvus / ZillizHNSW, IVF, DiskANN10¹⁰Largest scale, most indexes, K8s-native.
QdrantHNSW10⁹Rust, fast, rich payload filter.
Faiss (library)AnythinganyBuild-your-own; no distributed mgmt.
Module 4

Recommenders

The archetypal FAANG ML systems question: "design YouTube recommendations." Every good answer has the same bone structure — candidate generation, ranking, re-ranking — filled with different algorithms per stage.

Collaborative Filtering

Collaborative filtering assumes "users who liked similar things will like similar new things." The signal is the user-item interaction matrix, no content features. Two flavors:

  • User-based: for user u, find the top-K most similar users (cosine over rating vectors), recommend their top items.
  • Item-based: for each item i, precompute the top-K most similar items. At serve time, aggregate the similar items of u's recent interactions. This is what Amazon famously used in the 2000s ("people who bought this also bought").

Item-based wins in practice because you precompute item-item similarities offline (expensive but batched), and online lookup is just a hash join. User-based requires recomputing per user at query time.

Matrix Factorization

Assume each user u has a latent vector p_u ∈ ℝᴰ, each item i has q_i ∈ ℝᴰ, and rating r_ui ≈ p_u · q_i. Solve min Σ (r_ui - p_u · q_i)² + λ(||p||² + ||q||²). ALS and SGD both work; ALS parallelizes better (Spark MLlib).

# Matrix factorization with SGD — the Netflix Prize recipe
import numpy as np
U, I, D = 100_000, 500_000, 64
P = np.random.randn(U, D) * 0.01
Q = np.random.randn(I, D) * 0.01
lr, reg = 0.01, 0.02

for u, i, r in interactions:             # (user, item, rating)
    pred = P[u] @ Q[i]
    err  = r - pred
    P[u] += lr * (err * Q[i] - reg * P[u])
    Q[i] += lr * (err * P[u] - reg * Q[i])

Two-Tower Retrieval

The modern drop-in for MF: one neural net per side. Query tower consumes user features + recent context and emits a D-dim vector. Item tower consumes item features and emits a D-dim vector. Train with InfoNCE. Serve by indexing every item's tower output in an ANN index and doing one ANN lookup per query.

  • Input to query tower: user_id embedding, recent N clicks (sequence), country, device, time-of-day.
  • Input to item tower: item_id embedding, category id, creator id, title text embedding, image embedding.
  • Output: L2-normalized 128-dim vector on both sides. Similarity = dot product.
  • Item tower output is precomputed offline for the full catalog and pushed to the ANN index nightly.
  • Query tower runs at request time per user.

Two-tower trades some accuracy for enormous serving efficiency: a single ANN lookup replaces scoring millions of items.

Deep Retrieval & DCN

Once candidates are retrieved, you rank them with a much richer model that can consume the full cross-product of features. This is where Deep & Cross Networks (DCN v2), DLRM, and Wide & Deep live.

  • Wide & Deep (Google, 2016): Wide arm is a linear model over crossed features (memorization); deep arm is a 3-layer MLP over embeddings (generalization). Sum the logits, apply sigmoid.
  • DCN / DCN-v2: Explicitly learn feature crosses to arbitrary order via a cross layer: x_{l+1} = x_0 · (W·x_l + b) + x_l. Replaces hand-crafted crosses.
  • DLRM (Meta, 2019): Dense features through bottom MLP, sparse features through embedding tables, all pairwise dot products, concat, top MLP. Standard for ads.

3-Stage Pipeline

Candidate generation → Ranking → Re-ranking

Why three stages, not one

A single model that scores all 10⁹ items per request would need 10⁹ × feature_fetch × 10⁶ QPS = impossible. The three-stage funnel trades accuracy for throughput by using progressively more expensive models on progressively smaller candidate sets.

StageInput sizeOutput sizeLatency budgetModel
Candidate generation10⁹ items~100010 msTwo-tower + ANN (HNSW)
Ranking~1000~10020 msDLRM / DCN-v2 / GBDT
Re-ranking~100~105 msHeuristic + LambdaMART + rules

Common candidate-gen sources (union them)

  • Two-tower ANN — personalized 500 items.
  • Item-to-item co-visitation — 200 items similar to what user just watched.
  • Popularity in user's country/language — 100 items, combats cold start.
  • Recent items from subscribed channels — explicit follow signal.
  • Fresh items (uploaded last 24h) — exploration slot.
Module 5

Ranking Systems

Ranking is the middle stage of the funnel: given ~1000 candidates, score them and return the top ~100. Interviewers test whether you know which loss goes with which problem, and why position bias ruins naive models.

Pointwise / Pairwise / Listwise

Three ways to supervise a ranker, in increasing sophistication.

  • Pointwise — treat each (query, item) as an independent classification problem. Predict P(click | q, i). Loss is binary cross-entropy. Simple, fast to train, but the model does not know about ranking structure.
  • Pairwise — sample pairs (q, i⁺, i⁻) where i⁺ is clicked and i⁻ is not. Train the model so score(q, i⁺) > score(q, i⁻). Losses: RankNet (logistic over score difference), pairwise hinge. Matches ranking intuition better.
  • Listwise — loss operates on an entire ranked list. LambdaRank / LambdaMART weights each pair by the Δ-NDCG it would cause. ListNet applies softmax over the list. Highest accuracy but expensive to train.
# Pairwise RankNet loss — four lines
# s1 = score of item1, s2 = score of item2, y = 1 if item1 preferred else 0
import torch.nn.functional as F
def ranknet(s1, s2, y):
    # probability that 1 > 2 under the model
    p = torch.sigmoid(s1 - s2)
    return F.binary_cross_entropy(p, y)

Feature Crosses

A feature cross is a synthetic feature made by combining two raw features (e.g., user_country × item_language). Cross features model interactions that a linear model or simple MLP cannot learn at reasonable width.

  • Wide models (logistic regression, FTRL) rely entirely on hand-crafted crosses. Production ads systems have crossed 10⁴–10⁶ features.
  • DCN / DCN-v2 learn cross interactions up to arbitrary order with a stack of cross layers. No manual crossing.
  • DLRM: pairwise dot products between all embedding-table outputs implicitly produce second-order crosses.

Second-order cross has |A| × |B| cells; above 10⁶ you need the hashing trick.

LambdaMART, DLRM, MMoE

LambdaMART

A GBDT (LightGBM, XGBoost) trained with pairwise gradients weighted by Δ-NDCG. For years the dominant search ranker (Bing, Yandex). Fast at training time, interpretable feature importances. Weakness: can't ingest raw sequences or embeddings directly.

DLRM

Facebook's open-source click-through ranker. Two-part architecture:

  • Bottom MLP over dense features (age, CTR history, prices).
  • Embedding tables for sparse features (user_id, ad_id, page_id). Tables are enormous (100s of GB) and embedding-parallel across GPUs.
  • All embeddings + dense_bottom are pairwise-dotted, concatenated, fed through top MLP → sigmoid.

MMoE — Multi-gate Mixture-of-Experts

A multi-task architecture used by YouTube & Google Ads to jointly predict multiple outcomes (click, watch time, like, share). Shared experts produce N representations; each task has a gate that weights them. Lets correlated tasks benefit each other while avoiding negative transfer on uncorrelated ones.

MMoE shared experts with per-task gates

Position Bias & Exploration

Position bias: the higher an item appears, the more likely the user is to click it, independent of relevance. A naive click model trained on logged clicks learns "ranked first → high CTR," which collapses to a self-fulfilling feedback loop.

Three remedies

  • Position feature at train, zeroed at serve — include the position as an input at training time so the model attributes some CTR to position; at serve, set position = 0 (or a reference value) so the model doesn't use it for ranking.
  • Inverse Propensity Scoring — reweight each training sample by 1/P(observed | position).
  • Counterfactual / bandit approaches — log probabilities at serve time, use doubly robust estimators.

Exploration

An always-greedy ranker never collects data on items it ranks low. Three exploration strategies:

  • ε-greedy — with probability ε=1–5%, serve a random candidate in the top slot. Simple.
  • Thompson sampling — score = model prediction + noise from posterior uncertainty. Decays as confidence grows.
  • Dedicated exploration slots — reserve 1-of-10 result slots for fresh/cold-start items.
Module 6

LLM Serving Infrastructure

Serving a 70B-parameter transformer at thousands of QPS is not a math problem, it's a memory-bandwidth problem. Every optimization — KV cache, paged attention, continuous batching, speculative decoding, quantization — exists to squeeze more tokens per second out of a bandwidth-bound GPU.

KV Cache & Paged Attention

Decoder-only transformers generate one token at a time. At each step, the model computes attention over all previous tokens. Naively, that means re-running the full attention every step — O(L³) for a sequence of length L.

The KV cache is the standard fix. During the prefill pass you compute the Key and Value tensors for every input token and store them. During decoding, each new token only computes its own Q and attends against the cached K,V. Decoding drops to O(L) per step.

Paged Attention (vLLM, 2023)

Classical KV cache allocates one contiguous buffer per request, sized to max context length. That wastes memory: a request producing 100 tokens reserves space for 4096. Paged Attention splits KV cache into fixed-size blocks (typically 16 tokens) and keeps a per-request page table. Requests grab blocks on demand; fragmentation drops from ~60% to ~4%. This is what lets vLLM run 2–4× more concurrent requests than HuggingFace TGI of 2022.

Continuous (In-Flight) Batching

Static batching stalls because sequences finish at different times. If you batch 8 requests and one wants 4000 tokens while seven want 100, the GPU idles for those seven after step 100. Continuous batching (also called in-flight batching) evicts finished sequences from the batch and admits waiting requests every iteration.

Static vs continuous batching over time

Continuous batching typically raises throughput 3–5× vs static batching at identical latency, especially under mixed prompt lengths.

Speculative Decoding

Autoregressive decoding is serial: you must generate token N before N+1. Speculative decoding uses a small cheap draft model to guess K tokens, then runs the big model once to verify all K in parallel. If the draft model agrees with the big model on k tokens, you get k tokens in one big-model forward pass.

  • Draft model: a 1B or 7B variant of the big model family.
  • Target model: 70B or larger.
  • Speedup: 2–3× on average; zero quality loss (the big model rejects any incorrect draft token).
# Speculative decoding outline
def spec_decode(target, draft, prompt, K=4):
    x = prompt
    while not finished(x):
        # Draft produces K tokens cheaply
        draft_tokens = draft.generate(x, K)
        # Target verifies all K in ONE forward pass
        target_logits = target(x + draft_tokens)
        # Accept longest prefix where target agrees with draft
        k = longest_accept(draft_tokens, target_logits)
        x = x + draft_tokens[:k]
        # Always append one extra sample from the target's own distribution
        x = x + sample(target_logits[k])
    return x

Quantization

Weights are stored in fp16 or bf16 during training (16 bits each). At serve time, you can drop precision without much accuracy loss — moving less memory per token is the fastest way to speed up a bandwidth-bound GPU.

FormatBitsMemory for 70BQualityNotes
FP16 / BF1616140 GBBaselineTraining + inference default.
INT8 (W8A16)8 weights70 GB~0.1 pt dropLLM.int8, SmoothQuant.
INT4 (W4A16)4 weights35 GB~0.5 pt dropGPTQ, AWQ. Fits 70B on one 80 GB H100.
FP88 w+a70 GB~0 dropH100 native; ideal if hardware supports it.

Serving Stacks

  • vLLM — open-source; paged attention; continuous batching; tensor-parallel multi-GPU. Default for self-hosting in 2024+.
  • TGI (Hugging Face) — similar features; tighter integration with the HF model hub; Rust router.
  • NVIDIA Triton + TensorRT-LLM — maximum throughput on NVIDIA hardware; proprietary kernels; in-flight batching; complex to configure.
  • Ray Serve + vLLM backend — when you also need complex routing and scaling.
Module 7

Training Pipelines

A production training pipeline is a dataflow graph — data lake, processing, training, eval, registry, promotion — that runs on a schedule and emits a new model blob that gets canary-promoted into serving. Every box is a failure point.

Lake → Registry

Training pipeline: 7 stages
  • Data lake: S3/GCS Parquet, partitioned by date. Append-only, immutable; the source of truth.
  • Processing: Spark, Beam, or Flink jobs; compute features + labels; write tfrecords or parquet for trainer consumption.
  • Training: orchestrated by Airflow / Kubeflow / Flyte; runs on GPU cluster; writes to object store.
  • Eval: offline metrics on held-out data; slice metrics (new users vs tenured users, each locale, etc.).
  • Model registry: MLflow / SageMaker / Vertex. Stores weights + metadata + lineage (data version, code sha).
  • Canary: deploy to 1% traffic; automated guardrails on error rate, latency, online metric delta.
  • Promotion: if guardrails clean for 24–72h, ramp to 100%. Otherwise rollback.

Hardware

AcceleratorHBMFP16 TFLOPsInterconnectWorkload
A100 80GB80 GB312NVLink 600 GB/s, InfiniBand HDR 200 Gbps7B–30B dense, any recs training.
H100 80GB80 GB HBM31000 (FP16), 2000 (FP8)NVLink 900 GB/s, IB NDR 400 GbpsFrontier LLM training.
H200141 GB HBM3e1000Same as H100Long-context inference.
TPU v4 / v532 / 96 GB275 / 459ICI torus 3DGoogle-only; large batches.
MI300X (AMD)192 GB1300Infinity FabricHBM-rich inference.

DDP / FSDP / ZeRO

Data-parallel (DDP)

Replicate the whole model on every GPU. Each GPU sees a different batch slice. After backward, AllReduce the gradients. Works until the model fits in one GPU's memory.

FSDP / ZeRO

Shard the model parameters, gradients, and optimizer state across GPUs. During forward, AllGather the needed weights; during backward, ReduceScatter the gradients. Memory drops linearly with world size. ZeRO has three stages:

  • ZeRO-1: shard optimizer state only (Adam state is 12 B/param vs weights at 2 B/param fp16).
  • ZeRO-2: shard gradients too.
  • ZeRO-3 (= FSDP): shard parameters too. Equivalent to fully-sharded data-parallel.

Tensor / Pipeline parallel

Very large models (70B+) don't fit even under FSDP. Tensor parallel splits matmuls across GPUs (Megatron-LM). Pipeline parallel splits layers across GPUs, introducing bubble overhead. Combine with DP as 3D parallelism.

# FSDP in PyTorch — the one-line upgrade from DDP
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

model = FSDP(
    model,
    auto_wrap_policy=transformer_auto_wrap_policy,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.float32,     # keep reductions in fp32 for stability
        buffer_dtype=torch.bfloat16,
    ),
)

Mixed Precision & Checkpointing

  • Mixed precision (bf16 / fp16): keep a master copy of weights in fp32, cast to bf16 for forward/backward. Throughput doubles; bf16 has the same exponent range as fp32 so it rarely overflows (fp16 needs loss scaling).
  • Gradient accumulation: effective batch size = micro_batch × accumulation_steps × world_size. Lets you match a target batch size on small hardware.
  • Gradient / activation checkpointing: recompute forward activations during backward instead of storing them. Trades ~30% extra compute for ~50% less activation memory. Mandatory past ~7B params.
Module 8

Evaluation & A/B Testing

Offline metrics tell you whether a model is better on old logs. Online A/B tests tell you whether users actually react. Every FAANG launch gets gated on an A/B test — know how to design one, not just how to read one.

Offline Metrics

MetricRangeTaskWhat it measures
AUC / ROC-AUC0.5–1.0Binary classificationP(score(pos) > score(neg)). Ranking quality, ignores calibration.
Log-loss0–∞Prob. classificationBoth ranking and calibration. Required for bidding models.
NDCG@k0–1RankingDiscounted gain with logarithmic position decay. Default for search.
MRR0–1Single correct answerReciprocal rank of the first relevant item. QA, nav queries.
MAP0–1Multi-label rankingMean average precision. Document retrieval.
Recall@k0–1RetrievalFraction of relevant items in top-k. Candidate-gen eval.
RMSE / MAE0–∞RegressionScale-dependent; report alongside baseline.

Online Metrics

  • CTR (click-through rate) — short-term signal; easy to game via clickbait.
  • Watch time / session length — medium-term engagement; the canonical YouTube metric.
  • Retention / DAU / WAU — long-term; slow and noisy; the ultimate product metric.
  • Revenue per mille (RPM) — ads; directly ties model quality to dollars.
  • Guardrail metrics — latency p99, error rate, complaint rate, blocked-content rate. Must not regress.

A/B Design

A well-formed A/B test has six pieces:

  • Unit of randomization — user_id for most ML (session_id for logged-out). Never per-request; within-user spillover contaminates both arms.
  • Hypothesis — "new ranker increases watch-time by ≥ 0.5% with p < 0.01, no guardrail regression."
  • Primary metric — one pre-registered metric. Adjust for multiple comparisons if you're tracking more.
  • Sample size / power — computed from MDE (minimum detectable effect) and baseline variance. Typical FAANG A/B: 7–14 days, millions of users.
  • Traffic split — 50/50 if no risk; 95/5 with 1% canary first if guardrails unknown.
  • Stopping rule — fixed horizon or sequential (always-valid p-values, mSPRT).

Sample size, back-of-envelope

# For a continuous metric, two-sided test, α=0.05, power=0.80:
# n per arm ≈ 16 × σ² / Δ²
# σ = stdev of metric per user
# Δ = minimum detectable effect (absolute)
#
# For watch time: σ ≈ 30 min, Δ = 0.6 min → n ≈ 16 × 900 / 0.36 = 40,000/arm

CUPED & Interference

CUPED (Controlled-experiment Using Pre-Existing Data)

Variance reduction via pre-experiment covariates. Compute ŷ = y - θ · (x_pre - E[x_pre]) where x_pre is each user's metric value before the experiment. Lowers variance 30–50%, which cuts required sample size 30–50%. Standard at Microsoft, Netflix, Booking.

Interference

Classical A/B assumes units are independent. ML violates this in three ways:

  • Two-sided marketplaces (Uber, Airbnb): treating some riders affects driver availability for control. Fix: cluster randomization (by city / day).
  • Social spillover (Facebook feed): treated user's post seen by control user. Fix: ego-network randomization.
  • Learning loops: both arms write to the same candidate-gen training data → pollution. Fix: separate log streams.
MODULE 9

ML System Design Case Studies

Production pipelines at FAANG scale.

YouTube Recommendations

Scale: 2B DAU, billions of watch hours per day, > 500 hrs uploaded per minute. Candidate corpus 10⁸ videos.

Pipeline: Two-stage candidate generation → ranker → re-ranker. Candidate gen: two-tower (user tower + video tower) + ANN (HNSW) retrieval ~O(10³) candidates. Ranker: heavy DNN (Deep & Cross) on ~10³ cands returns score per item. Re-ranker: diversity / fairness / policy filters.

Features: user demographics, watch history embeddings, context (device, geo, time-of-day, query if searched), video side features (category, uploader, length, freshness). Training data: implicit (watch time > threshold) and explicit (like, subscribe).

YouTube 2-stage recommendation pipeline

Google Search Ranking

Lexical + semantic hybrid. BM25 on inverted index for initial top-K (10⁴), neural ranker (BERT-like cross-encoder) for re-rank top 10². Quality signals: PageRank, user engagement clicks, freshness, authority.

Netflix Personalization

Row-level selection: each shelf ("Trending", "Because you watched X", "New releases") has own candidate gen + ranker. Artwork personalization: multi-armed bandit picks per-user thumbnail.

TikTok For-You Page

Dense retrieval + heavy engagement feedback loop. Candidate generation uses user/video twin-tower embeddings + content-based + social graph. Ranker predicts finish-watch probability + like + share. Tight feedback: < 1 hr model refresh on recent engagement.

Ads Ranking

Expected revenue = bid × pCTR × pCVR. Second-price auction on ranked candidates. pCTR model is logistic regression on feature crosses (historically) → DCN / DLRM (modern). Budget pacing: smooth spend across day via throttling.

Ads ranking — expected revenue auction

GitHub Copilot Serving

Client keystroke → context extraction (cursor + nearby files + imports) → serverless LLM call with KV cache + speculative decoding. Latency target: < 300ms first token. Quality gates: license filtering, secrets detection on output.

MODULE 10

Cheat Sheet

ML interview framework + memorizable numbers.

ML System Design Framework

  1. Clarify — business metric, user segment, scale (QPS, corpus size), latency budget, fairness / policy constraints.
  2. Data — label source (implicit/explicit), training window, sampling bias, feedback loop.
  3. Features — user, item, context; feature store (online vs offline); crosses.
  4. Model — baseline (logistic / GBDT) → deep (two-tower / DCN / Transformer). Justify upgrade with offline lift.
  5. Serving — batch vs online, ANN vs exact, KV cache, quantization, GPU vs CPU.
  6. Evaluation — offline (AUC, NDCG@k), online A/B (primary metric + guardrails).
  7. Monitoring — feature drift, model decay, train/serve skew, engagement anomalies.

Numbers to Memorize

  • Embedding dim: 64–256 common for retrieval; 768 for BERT-base; 4096 for LLaMA-2-70B.
  • ANN latency: HNSW M=16, ef=200 → P99 < 10ms on 10⁸ vectors.
  • Transformer FLOPs: ~6 × N × D² per token for decoder.
  • KV cache size: 2 × L × H × d_h × seq_len × batch × bytes_per_val.
  • GPU utilization: well-tuned training should hit > 50% FLOP utilization.
  • A/B test sample size: ~10⁴–10⁶ users per arm for most product metrics (with CUPED lowering this 30–50%).