ML Fundamentals Recap
A fast re-grounding in the three things every ML systems interview assumes: tasks & targets, loss functions, and the training loop. No calculus derivations — just the moving parts you must be able to draw on a whiteboard.
Tasks & Targets
An ML system exists to predict a target from a feature vector. The interviewer's very first clarifying question — "what are we predicting?" — is not rhetorical; the answer determines the loss, the metric, and whether you can even collect labels. There are four shapes of prediction problem that cover 95% of interview scenarios.
- Binary classification — the label is 0/1. Examples: click, spam, churn, fraud. Output layer is a single sigmoid; loss is binary cross-entropy.
- Multi-class — the label is exactly one of K classes. Examples: MNIST, language ID, intent routing. Softmax head; categorical cross-entropy.
- Regression — the label is a real number. Examples: ETA, price, ad CPC, latency. Linear output; MSE / MAE / Huber.
- Ranking / retrieval — there is no single "answer"; you return an ordered list. Examples: search, recommendation, ads. Loss acts on pairs or lists of items; metric is NDCG / MRR / recall@k.
Each of these maps to a different system shape. A click predictor looks like "feature store → DLRM → logistic head → calibration → bidder." A ranker looks like "candidate generator → scorer → re-ranker." If you do not name the task shape in the first two minutes of the interview, the rest of your design will drift.
Losses You Must Know
Losses are the contract between the task and the gradient. Memorize these six; almost every production system is a composition of them.
| Loss | Task | Form | When to reach for it |
|---|---|---|---|
| Binary cross-entropy | Click / fraud / CTR | -y·log(p) - (1-y)·log(1-p) | Default for 0/1 labels; probabilistic interpretation needed for bidding. |
| Categorical cross-entropy | Language ID, image class | -Σ y_k·log(p_k) | Exclusive K classes, softmax head. |
| MSE / L2 | ETA, price | (y - ŷ)² | Symmetric errors, penalize outliers; watch for label skew. |
| Huber | ETA with outliers | L2 near 0, L1 at tails | MSE's smoothness with L1's robustness. |
| Hinge | SVMs, margin classifiers | max(0, 1 - y·ŷ) | Sparse support vectors; rarely production now. |
| Contrastive / NCE / InfoNCE | Embeddings, retrieval | -log(e^s⁺ / Σ e^s) | Trains two-towers; pushes positive pair together, negatives apart. |
| Listwise (LambdaRank, ListNet) | Search ranking | Δ-NDCG weighted pairwise | Optimizes the ranking metric directly. |
Minimal PyTorch reference
import torch
import torch.nn.functional as F
# 1) Binary cross-entropy (logit form is numerically stable)
logits = model(x) # shape [B]
loss = F.binary_cross_entropy_with_logits(logits, y.float())
# 2) Categorical (softmax over K)
logits = model(x) # shape [B, K]
loss = F.cross_entropy(logits, y.long())
# 3) Contrastive InfoNCE (batch negatives, temperature 0.07)
q = F.normalize(query_tower(x_q), dim=-1) # [B, D]
k = F.normalize(item_tower(x_k), dim=-1) # [B, D]
logits = q @ k.T / 0.07 # [B, B]
labels = torch.arange(q.size(0), device=q.device)
loss = F.cross_entropy(logits, labels) # diagonals are positives
Gradient-Based Training
Every model in this notebook — from two-tower to 70B transformer — is trained by the same five-line loop: sample a batch, forward, compute loss, backward, step the optimizer. What varies is the optimizer, the batch size, and the parallelism. The core loop is non-negotiable.
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
for epoch in range(epochs):
for x, y in loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad(set_to_none=True)
logits = model(x)
loss = loss_fn(logits, y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # stability
optimizer.step()
scheduler.step()
Optimizer picks, by era
- SGD + momentum — still the academic default for ResNets on ImageNet; cheap memory.
- Adam / AdamW — default for transformers and anything with sparse features. Costs 2× weights in state (first/second moment).
- Adafactor / 8-bit Adam — when optimizer state memory is the bottleneck on 10B+ params.
- Lion / Sophia — newer, faster convergence claims; not yet standard in interview answers.
Learning rate: the one hyperparameter that matters
If you have one hour to tune a model, spend it on the learning rate. Rule of thumb: Adam likes 1e-4 to 1e-3 for transformers; SGD likes 0.1 with momentum 0.9 for conv nets. A cosine schedule with 1–5% warmup is the modern default. Gradient clipping at norm 1.0 makes transformer training not blow up on the first few steps.
Overfitting & Regularization
Overfitting is when the model memorizes the training set and generalizes poorly. Every regularization technique in production ML is one of three ideas: shrink the hypothesis space, add noise, or stop early.
- L2 (weight decay) — penalize ||w||². Shrinks weights toward zero; built into AdamW. Default 1e-2 for transformers.
- L1 — penalize ||w||₁. Produces sparse weights; useful when you want feature selection.
- Dropout — randomly zero p fraction of activations during training. p=0.1 is default for transformers; p=0.5 for older FC nets.
- Label smoothing — replace one-hot targets with (1-ε, ε/(K-1)...). Prevents overconfident logits; hurts calibration slightly.
- Data augmentation — random crops/flips for images, token masking for text. Free regularization via more effective samples.
- Early stopping — watch the val loss; stop when it starts climbing. Free, always works.
Bias-Variance
Every interview eventually asks "what's bias-variance?" The short answer is: bias is how wrong your model's best guess is on average; variance is how much that guess wobbles when you resample the training set. Total error ≈ bias² + variance + irreducible noise.
| Symptom | Diagnosis | Fix |
|---|---|---|
| Train loss high, val loss high, similar | High bias (underfit) | Bigger model, more features, less regularization, train longer. |
| Train loss low, val loss high | High variance (overfit) | More data, stronger regularization, simpler model, early stop. |
| Val loss low, online metric bad | Distribution shift / leakage | Check train/serve skew, feature freshness, label definition. |
| Loss NaN after step 50 | Exploding gradients | Clip grad norm, drop lr 10×, check for division by zero. |