Roadmap · 22 modules · First principles
Backend Engineering — From First Principles
End-to-end backend curriculum: how a request travels the wire, hits HTTP, gets parsed, validated, routed, authorized, processed by business logic, queried against DB/cache/search, observed, and shipped. Each module: concepts → patterns → code → gotchas → interview lens.
MODULE 01 — FOUNDATIONS
Request Flow End-to-End
Browser → DNS → TCP → TLS → HTTP → LB → server → response. Every hop matters.
The full path
browser
│
├─ DNS lookup (recursive resolver → root → TLD → authoritative)
│
├─ TCP 3-way handshake (SYN → SYN-ACK → ACK) [1 RTT]
├─ TLS 1.3 handshake (ClientHello → cert → finished) [1 RTT]
│
├─ HTTP request bytes ──► public internet ──► ISP ──► transit ──► cloud edge
│ │
│ ▼
│ CDN / Cloudflare / AWS edge
│ │
│ ▼
│ Load Balancer (L7)
│ │
│ ▼
│ Application server
│ (routing → middleware
│ → controller → service
│ → DB / cache / queue)
│ │
◄────────────────── HTTP response (status, headers, body) ◄───────────┘
Hops & what each does
Hop Layer Job
DNS App Resolve api.example.com → IP. Cached at OS, browser, resolver.
Firewall / NAT L3/L4 SNAT private → public IP. Drops disallowed traffic.
CDN edge L7 Serve static/cached. Terminate TLS close to user.
WAF L7 OWASP rules: block SQLi/XSS patterns, geo blocks.
Load balancer L4 / L7 Spread across servers. Health-check pools. Sticky or stateless.
API gateway L7 Auth, rate limit, transform, route to upstream services.
App server L7 Run your code: middleware chain → handler → response.
The response: structure
HTTP/2 200 OK
content-type: application/json; charset=utf-8
content-length: 87
cache-control: private, max-age=60
x-request-id: 7f3a-1c2b
date: Mon, 11 May 2026 09:14:22 GMT
{"id": 42, "name": "alice", "email": "a@x.com"}
Interview lens: "User clicks button — what happens?" Walk all hops. Mention TLS 1.3 = 1-RTT handshake (vs 2 in 1.2). Mention happy-eyeballs (IPv6 then IPv4 fallback). Mention HTTP/3 over QUIC removes TCP head-of-line blocking.
MODULE 02 — PROTOCOL
HTTP Protocol
Message structure, headers, methods, CORS, status codes, caching, versions, TLS.
Raw message format
# request
POST /api/users HTTP/1.1
Host: api.example.com
Content-Type: application/json
Authorization: Bearer eyJ...
Content-Length: 41
{"email":"a@x.com","password":"hunter2"}
# response
HTTP/1.1 201 Created
Content-Type: application/json
Location: /api/users/42
{"id":42,"email":"a@x.com"}
Header families
Family Examples Purpose
Request Host, User-Agent, Accept, AuthorizationDescribe sender + intent
Representational Content-Type, Content-Encoding, Content-Length, ETagDescribe body bytes
General Date, Connection, Cache-Control, ViaApply both directions
Security Strict-Transport-Security, X-Frame-Options, Content-Security-Policy, X-Content-Type-OptionsBrowser hardening
Methods & semantics
Method Safe Idempotent Body Use
GET✓ ✓ — Read resource
HEAD✓ ✓ — Headers only (existence/size check)
OPTIONS✓ ✓ — CORS pre-flight, capability discovery
POST✗ ✗ ✓ Create / non-idempotent action
PUT✗ ✓ ✓ Full replace at known URI
PATCH✗ ✗* ✓ Partial update (*idempotent w/ JSON Merge Patch)
DELETE✗ ✓ — Remove resource
CORS — Cross-Origin Resource Sharing
Browser enforces same-origin policy. Server opts other origins in via headers.
Simple request
Methods GET/HEAD/POST only.
Only "safelisted" headers (Accept, Content-Language, Content-Type: text/plain | application/x-www-form-urlencoded | multipart/form-data).
Browser sends directly with Origin: https://app.x.com. Server returns Access-Control-Allow-Origin.
Pre-flight
# browser sends first:
OPTIONS /api/users HTTP/1.1
Origin: https://app.x.com
Access-Control-Request-Method: PUT
Access-Control-Request-Headers: authorization, content-type
# server replies:
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://app.x.com
Access-Control-Allow-Methods: GET, POST, PUT, DELETE
Access-Control-Allow-Headers: authorization, content-type
Access-Control-Allow-Credentials: true
Access-Control-Max-Age: 86400
Status codes — ones that matter
Range Code · meaning
2xx 200 OK · 201 Created · 202 Accepted (async) · 204 No Content · 206 Partial (range)
3xx 301 Moved Permanent · 302 Found · 304 Not Modified (ETag hit) · 307/308 preserve method on redirect
4xx 400 Bad Request · 401 Unauthorized (= unauthenticated) · 403 Forbidden · 404 Not Found · 409 Conflict · 422 Unprocessable · 429 Too Many Requests
5xx 500 Internal Error · 502 Bad Gateway · 503 Service Unavailable · 504 Gateway Timeout
Caching: ETag vs max-age
# first response carries ETag
HTTP/1.1 200 OK
ETag: "v1-7f3a"
Cache-Control: max-age=60, must-revalidate
# subsequent request — conditional
GET /api/users/42
If-None-Match: "v1-7f3a"
# server unchanged → no body
HTTP/1.1 304 Not Modified
ETag: "v1-7f3a"
Strong validators : ETag (byte-exact), Last-Modified (second precision).
Freshness : max-age=N = fresh for N sec, no server hit.
private = only end-user cache. public = CDN can cache.
no-store = never persist (sensitive data). no-cache = revalidate every time.
HTTP versions
Version Transport Multiplexing Head-of-line Header compression
HTTP/1.1 TCP, plaintext One req/conn (pipelining broken) App-layer None
HTTP/2 TCP, binary frames Streams over 1 conn TCP-level still blocks HPACK
HTTP/3 QUIC over UDP Independent streams None (per-stream loss) QPACK
Content negotiation & compression
Accept: application/json;q=0.9, application/xml;q=0.5
Accept-Encoding: gzip, br, zstd
Accept-Language: en-US, en;q=0.8
# server picks best match, replies:
Content-Type: application/json
Content-Encoding: br
TLS / HTTPS
TLS provides confidentiality (encryption), integrity (MAC), authentication (cert chain).
TLS 1.3 handshake: 1 RTT (vs 2 in 1.2), 0-RTT for session resumption.
Cert chain: leaf → intermediate → root (trust anchor in OS).
SNI lets one IP host many TLS sites (sends hostname in ClientHello).
HSTS header Strict-Transport-Security: max-age=31536000; includeSubDomains; preload forces HTTPS for year.
Gotcha: 401 Unauthorized is misnomer — means unauthenticated . Use 403 when caller is authenticated but lacks permission.
MODULE 03 — DISPATCH
Routing
URL → handler. Method-aware. Versioned. Grouped. Fast.
Route components
GET /api/v1/users/:userId/posts?status=published&limit=20
│ │ │ │ │
│ │ │ │ └─ query params (filters, paging)
│ │ │ └─ path param (resource id)
│ │ └─ resource (collection)
│ └─ version
└─ namespace
Route types
Type Example Notes
Static /healthO(1) hash lookup possible.
Dynamic /users/:idParam capture. Most frameworks use radix/trie.
Nested / hierarchical /orgs/:org/teams/:team/membersAuthorization often cascades.
Catchall / wildcard /files/*pathGreedy — last priority.
Regex /users/{id:\d+}Type-narrowed. Powerful, slower.
API versioning strategies
Strategy Example Pros / Cons
URI /v1/users+ Visible, cache-friendly. − Many URLs.
Header API-Version: 2+ Clean URL. − Hidden in tooling.
Query ?v=2+ Trivial. − Breaks caching on shared keys.
Media type Accept: application/vnd.x.v2+json+ RESTful. − Hardest to test.
Deprecation pattern
HTTP/1.1 200 OK
Deprecation: true
Sunset: Wed, 01 Jan 2027 00:00:00 GMT
Link: <https://api.x.com/v2/users>; rel="successor-version"
Warning: 299 - "v1 deprecated; migrate to v2 by 2027-01-01"
Route grouping
# pseudo-framework
group("/api/v1", middleware=[logger, requestId]) {
group("/auth", middleware=[rateLimit(5, "1m")]) {
POST("/login", loginHandler)
POST("/refresh", refreshHandler)
}
group("/admin", middleware=[requireAuth, requireRole("admin")]) {
GET("/users", listUsers)
DELETE("/users/:id", deleteUser)
}
}
Matching perf: radix trees (httprouter, fasthttp) match O(path length). Regex routers degrade to O(routes). Order static → dynamic → wildcard.
MODULE 04 — DATA ON THE WIRE
Serialization & Deserialization
Native ↔ wire bytes. Pick format by audience + perf budget.
Text vs binary
JSON XML Protobuf MessagePack Avro
Readable ✓ ✓ ✗ ✗ ✗
Schema optional XSD required none required
Size baseline 1.5–2× 0.2–0.5× 0.5× 0.3×
Parse speed baseline slow 10–20× faster 5× 10×
Use web APIs legacy/SOAP gRPC, internal cache, RPC Kafka, big data
JSON deep-dive
{
"string": "hello",
"int": 42,
"float": 3.14,
"bool": true,
"null": null,
"array": [1, 2, 3],
"nested": { "k": "v" },
"date": "2026-05-11T09:14:22Z" // ISO-8601 with offset
}
Native mapping
JSON Python Go JS/TS
object dictstruct / map[string]anyobject
array list[]TArray
number int/floatfloat64 (or typed)number
null Nonenil / zero / pointernull
Edge cases
Missing fields — apply defaults; use Optional[T]/pointers to distinguish absent vs null.
Extra fields — strict mode reject, lenient ignore. Default to reject for inbound user data.
Numbers — JS number = float64, loses precision past 2^53. Send big ints as strings.
Dates — ISO-8601 with timezone offset (Z = UTC). Never plain "2026-05-11 09:14:22".
Floats — money in cents (integer) or decimal-string. Never float for currency.
Null vs absent — for PATCH, "absent" = leave alone, "null" = clear field.
Schema validation (JSON Schema)
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["email", "age"],
"additionalProperties": false,
"properties": {
"email": { "type": "string", "format": "email", "maxLength": 254 },
"age": { "type": "integer", "minimum": 0, "maximum": 150 },
"tags": { "type": "array", "items": { "type": "string" }, "maxItems": 20 }
}
}
Insecure deserialization: language-native binary formats (Python's pickle, Java ObjectInputStream, PHP unserialize) can execute attacker-supplied code on parse. Never deserialize untrusted input with binary native formats. JSON-only at external boundaries.
MODULE 05 — TRUST
Authentication, Authorization, Security
Who you are, what you can do, how attackers try to get past it.
Authentication mechanisms
Mechanism State How Use
Basic auth stateless Authorization: Basic base64(user:pass)Internal/dev. HTTPS mandatory.
API key stateless Long random string per client Server-to-server, partner APIs.
Session cookie stateful (server) Random ID → server lookup Classic web apps.
JWT (bearer) stateless Signed claims in token SPAs, mobile, microservices.
OAuth 2.0 delegated Authorization code → access token Third-party app access.
OIDC OAuth + identity OAuth + id_token JWT SSO / "Login with Google".
MFA +factor TOTP, WebAuthn, SMS (weak) High-value accounts.
JWT anatomy
header.payload.signature
# header
{ "alg": "RS256", "typ": "JWT", "kid": "key-2026-q2" }
# payload (claims)
{
"sub": "user_42",
"iss": "https://auth.x.com",
"aud": "api.x.com",
"exp": 1715420000,
"iat": 1715416400,
"scope": "read:users write:posts"
}
# signature = sign(base64(header) + "." + base64(payload), private_key)
JWT pitfalls
Algorithm confusion : validate alg against allow-list. Reject none. Don't trust kid blindly.
No revocation : tokens valid until exp. Use short TTL (5–15 min) + refresh tokens for logout.
Stored payload is public : signed, not encrypted. Don't put secrets in claims.
Clock skew : allow ±60s on exp/nbf.
Password storage
# NEVER: plaintext, MD5, SHA-1, SHA-256 plain
# CORRECT: slow KDF with per-user salt
hash = argon2id(password, salt, m=64MB, t=3, p=1)
# alternatives: bcrypt (cost ≥ 12), scrypt, PBKDF2 (≥ 600k iter)
Authorization models
Model Decision input Example
RBAC role-based(user, role) → perms admin, editor, viewer.
ABAC attribute-based(user.attrs, resource.attrs, env) → allow? "engineer in same dept can read".
ReBAC relationship-basedgraph: user → owns → doc Google Docs sharing. Zanzibar.
OWASP-style attacks & defenses
Attack Mechanism Defense
SQL injection Untrusted input concatenated into SQL Parameterized queries / prepared statements. Never string-interpolate.
NoSQL injection Object-shaped input replaces operators Schema validate. Reject objects where strings expected.
XSS Untrusted HTML rendered Context-aware escaping. CSP header. HttpOnly cookies.
CSRF Browser auto-sends cookies SameSite=Lax/Strict, CSRF tokens, double-submit, Origin check.
MITM Network attacker reads traffic TLS everywhere, HSTS, cert pinning for mobile.
Insecure deserialization Native binary parsers on untrusted input JSON only for untrusted; signed payloads for internal.
SSRF Server fetches attacker URL Allow-list URLs. Block link-local + metadata IPs (169.254.169.254).
IDOR Predictable IDs without authz check Check ownership server-side every request. UUIDs help defense-in-depth.
Secure design principles
Least privilege — give each subject minimum it needs.
Defense in depth — overlapping layers (WAF, app validation, DB constraints).
Fail secure — when in doubt, deny. Don't open-default on errors.
Separation of duties — same human can't approve and execute payouts.
CSP — Content-Security-Policy: default-src 'self'; script-src 'self' 'nonce-xyz'.
SameSite cookies — Set-Cookie: session=...; HttpOnly; Secure; SameSite=Lax.
Attack prevention practices
Audit-log failed logins, privilege escalations, admin actions. Tamper-evident store.
Generic error messages on auth ("invalid credentials" — don't leak which part wrong).
Rate limit per-IP + per-account. Exponential backoff. Lock after N failures.
Constant-time compare for tokens/HMAC (hmac.compare_digest) — avoid timing attacks.
Interview lens: "How do you store passwords?" → argon2id with per-user salt + pepper. "Why salt?" → defeats rainbow tables. "Why slow KDF?" → makes brute force infeasible per guess.
MODULE 06 — INPUT HYGIENE
Validation, Transformation, Normalization
Fail fast on bad input. Normalize before processing. Sanitize before storing.
Three validation types
Type What it checks Examples
Type Right shape String not array; integer not string.
Syntactic Right format Email regex, UUID, ISO date, phone.
Semantic Right meaning Age 0–150; endDate > startDate; SKU exists in catalog.
Client vs server
Client-side validation = UX (instant feedback). Cannot be trusted.
Server-side validation = security. Always re-validate.
Fail fast: validate at edge, before middleware does work.
Transform & normalize
email = email.strip().lower()
phone = re.sub(r'\D', '', phone) # digits only
country = country.upper() # "us" → "US"
name = ' '.join(name.split()) # collapse spaces
slug = slugify(title) # "Hello World!" → "hello-world"
Sanitization (escape, don't trust)
# HTML
clean_html = bleach.clean(user_html, tags=['p','b','i'], strip=True)
# SQL — never string-format
cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Shell — avoid; if must, use shlex.quote
Complex rules
Relationship : password == confirmPassword.
Conditional : if type == "business", then taxId required.
Chained : parse → type-check → range-check → cross-field check.
Error aggregation
HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json
{
"type": "https://x.com/errors/validation",
"title": "Validation failed",
"status": 422,
"errors": [
{ "field": "email", "code": "invalid_format", "message": "not a valid email" },
{ "field": "age", "code": "out_of_range", "message": "must be 0–150" }
]
}
Pro: use schema library (pydantic, zod, joi, valibot). Codifies type + syntactic + semantic in one place; auto-generates OpenAPI.
MODULE 07 — PIPELINE
Middleware
Cross-cutting logic in chain. Order matters more than content.
What middleware does
Run code before handler (parse, auth, log start).
Run code after handler (log status, add headers, compress).
Short-circuit — return early (401, 429, 404) without calling next.
Canonical ordering
request ──► recovery (panic/exception catcher)
──► requestId / traceId
──► access log start
──► CORS
──► security headers (HSTS, X-Content-Type-Options, CSP)
──► body parser (json, urlencoded, multipart)
──► compression negotiation
──► rate limiter
──► authentication
──► authorization
──► validation
──► route ──► handler ──► response
◄── log finish (status, duration)
◄── error handler (if thrown)
Common middlewares
Type Examples
Security helmet (sets headers), CSRF, CORS
Parsing JSON, urlencoded, multipart (file upload)
Auth JWT verify, session lookup, API-key check
Rate limit token bucket per IP/user/route
Logging access log, request-id propagation
Compression gzip/br based on Accept-Encoding
Error centralized handler — maps exceptions to status codes
Keep middleware lightweight
Every middleware runs on every request . 1 ms each × 10 middlewares = 10 ms baseline.
Heavy work (image processing, external API calls) belongs in handler/job, not middleware.
Cache decisions (e.g., JWKS public keys) — don't refetch per request.
Gotcha: putting body parser after auth on JWT route is fine. Putting it after rate limit when limit needs body content (per-mutation) breaks. Think about what each layer needs.
MODULE 08 — STATE
Request Context
Per-request scratch space that flows with the call — without leaking across requests.
What lives in context
Metadata : URL, headers, method, remote IP, start time.
Identity : userId, orgId, scopes after auth middleware.
Tracing : requestId, traceId, spanId for correlation.
Cancellation : timeout / abort signal for downstream calls.
DB conn / transaction : scoped per request so all reads see same snapshot.
Patterns
Language Pattern
Go context.Context as first arg. ctx.WithValue, ctx.Done(), ctx.Deadline.
Node AsyncLocalStorage (avoids passing through every layer).
Python contextvars.ContextVar (async-safe).
Java ThreadLocal (blocking) / Context with reactive frameworks.
Request ID propagation
# inbound middleware
const reqId = req.headers['x-request-id'] ?? randomUUID()
res.setHeader('x-request-id', reqId)
ctx.set('requestId', reqId)
logger.child({ reqId })
# outbound HTTP calls
fetch(url, { headers: { 'x-request-id': ctx.get('requestId') }})
Timeouts & cancellation
# Go
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
row := db.QueryRowContext(ctx, "SELECT ...") # aborts on timeout
# Node
const ctrl = new AbortController()
setTimeout(() => ctrl.abort(), 2000)
await fetch(url, { signal: ctrl.signal })
Memory leak: stuffing big payloads (body, files) into context that outlives request. Strip on response.
MODULE 09 — STRUCTURE
MVC, Controllers, REST APIs
Separation of concerns inside request path.
Layered responsibility
Layer Owns Doesn't touch
Handler / Controller Parse req, validate, call service, shape response SQL, business rules
Service (business logic)Use cases: placeOrder, cancelSubscription HTTP, DB driver specifics
Repository / DAO Persistence, queries, ORM calls Business rules, HTTP
Model Entity definition, invariants I/O
CRUD ↔ HTTP mapping
POST /users → create
GET /users → list (paginated)
GET /users/:id → read
PUT /users/:id → full replace
PATCH /users/:id → partial update
DELETE /users/:id → remove
POST /users/:id/reset → action (non-CRUD verb)
Standard list response
{
"data": [ { "id": 1, "name": "alice" }, ... ],
"meta": {
"page": 2,
"limit": 20,
"total": 137,
"hasMore": true
},
"links": {
"self": "/users?page=2&limit=20",
"next": "/users?page=3&limit=20",
"prev": "/users?page=1&limit=20"
}
}
Pagination styles
Style Pros Cons
Offset (?page=N) Simple, jump to page Slow on big tables; inconsistent with writes
Cursor (?after=cursor) Stable, fast, infinite-scroll friendly No "jump to page N"
Keyset (WHERE id > ?) Same as cursor; index-friendly Requires sortable monotonic key
Search / sort / filter
GET /products?q=phone&category=electronics&minPrice=100&sort=-price,name&page=2
# parsed:
{
q: "phone",
filters: { category: "electronics", price: { gte: 100 } },
sort: [{ field: "price", dir: "desc" }, { field: "name", dir: "asc" }],
page: 2
}
REST principles
Resource-oriented — nouns in URLs, verbs in methods.
Stateless — every request carries enough auth/context.
Cacheable — GETs use ETags / max-age.
Uniform interface — consistent shape across resources.
HATEOAS (optional) — responses embed links to next actions.
Redact sensitive fields — never serialize password_hash, even hashed.
OpenAPI spec — define contract first; generate client/server stubs.
MODULE 10 — PERSISTENCE
Databases
Storage shape, consistency, indexing, query plans, ORMs.
Relational vs non-relational
Relational (Postgres, MySQL) Document (Mongo) Key-value (Redis, DynamoDB) Wide-column (Cassandra)
Schema fixed flexible none row-flexible
Joins strong weak (lookup/aggregate) none none
Txn full ACID per-doc, multi-doc limited per-key per-row
Scale vertical + read replicas shard by key horizontal horizontal
Use transactional, complex queries nested objects, agile schema cache, hot keys time-series, massive write
ACID
Atomicity — all-or-nothing within txn.
Consistency — txn moves DB between valid states (constraints hold).
Isolation — concurrent txns don't see each other mid-flight. Levels: Read Uncommitted → Read Committed → Repeatable Read → Serializable.
Durability — committed data survives crash (fsync to disk).
CAP theorem
Under network P artition, must choose: C onsistency (reject reads) or A vailability (serve possibly-stale). Real systems pick on partition — most of the time partitions are rare and you have both.
CP : HBase, MongoDB (default), etcd, ZooKeeper.
AP : Cassandra, DynamoDB (tunable), Riak.
PACELC extension: even without partition (E), trade latency (L) vs consistency (C).
Indexing — rules
B-tree indexes power range + equality on leading column(s).
Composite index (a, b, c) serves WHERE a=?, WHERE a=? AND b=?, not WHERE b=? alone.
Covering index: include all SELECT columns → "index-only scan", no heap fetch.
Each index costs writes (update on insert/update/delete). Audit unused.
Hash indexes: equality only. GIN/GIST: full-text, JSONB, arrays. BRIN: huge append-only tables.
Query optimization workflow
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US' AND o.created_at > now() - interval '30 day'
GROUP BY u.name;
# look for:
# Seq Scan on big table → missing index
# high "rows removed by filter" → predicate not pushed to index
# Sort spilled to disk → work_mem too low
# Nested Loop on big rowcounts → expected Hash/Merge join
Connection pooling
Opening Postgres conn ≈ 5–50 ms. Pool to reuse.
Pool size ≈ min(N_cores * 2 + spindles, db_max_connections / instances).
For serverless / many instances → use PgBouncer in transaction mode.
Always set acquire timeout + max-lifetime to recycle stale conns.
Constraints & transactions
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = $1;
UPDATE accounts SET balance = balance + 100 WHERE id = $2;
INSERT INTO transfers (from_id, to_id, amount) VALUES ($1, $2, 100);
COMMIT;
-- table constraints catch invariants:
CHECK (balance >= 0)
UNIQUE (email)
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
ORMs & migrations
ORMs (Prisma, SQLAlchemy, GORM, Hibernate) trade SQL control for ergonomics.
Watch for N+1 : users.all() then user.posts in loop → use eager load / JOIN.
Migrations: forward-only, idempotent, reviewed in code. Tools: flyway, alembic, knex, goose, prisma migrate.
Online schema changes for big tables: gh-ost, pt-osc, or expand-contract pattern.
Numbers: Postgres prefers stay below ~100 active conns. PgBouncer can multiplex thousands of clients onto that pool.
MODULE 11 — DOMAIN
Business Logic Layer
Where rules live. Independent of HTTP and DB drivers.
Three-layer architecture
┌──────────────────────────────────┐
│ Presentation │ routes, controllers, DTOs,
│ (HTTP / gRPC / CLI) │ validation, serialization
└────────────┬─────────────────────┘
│ calls
┌────────────▼─────────────────────┐
│ Business Logic │ use-case services, domain models,
│ (pure, framework-agnostic) │ rules, invariants, orchestration
└────────────┬─────────────────────┘
│ uses ports
┌────────────▼─────────────────────┐
│ Data Access │ repositories, ORM, SQL, cache
│ (Postgres / Redis / S3 / 3rd p.) │ adapters, external clients
└──────────────────────────────────┘
Why split
Testability — unit-test business rules without spinning up HTTP/DB.
Reuse — same service from REST, gRPC, CLI, cron.
Swap adapters — Postgres → DynamoDB; HTTP → message queue. Only boundary changes.
SOLID applied
Principle How it shows up
Single responsibility One service = one use case. RegisterUserService.handle().
Open/closed New auth provider = new adapter implementing AuthPort; existing code unchanged.
Liskov Any UserRepository impl must honor contract (same shapes/errors).
Interface segregation ReadOnlyUserRepo vs full repo for handlers that only read.
Dependency inversion Service depends on EmailSenderPort (interface), not SendgridClient (concrete).
Error propagation pattern
# BLL throws domain errors
class DomainError(Exception): ...
class NotFound(DomainError): ...
class Forbidden(DomainError): ...
class Conflict(DomainError): ...
# presentation layer maps to HTTP
{
NotFound: 404,
Forbidden: 403,
Conflict: 409,
Validation: 422,
DomainError: 500,
}[type(e)]
Pro: domain errors carry semantic codes (USER_NOT_FOUND), not HTTP codes. HTTP mapping happens at edge only.
MODULE 12 — SPEED
Caching
Trade staleness for latency. Choose layer, strategy, eviction with intent.
Caching layers
Layer Latency Scope Example
CPU L1/L2/L3 ns Per-core —
App in-memory µs Per-process LRU map, Caffeine
Distributed 0.5–2 ms Cluster Redis, Memcached
CDN edge 1–30 ms Region Cloudflare, CloudFront
Browser 0 ms Per-client HTTP cache
Strategies
Strategy Read Write Use
Cache-aside (lazy) app checks cache → on miss reads DB → fills cache app writes DB → invalidates cache Default. Most common.
Read-through cache lib reads DB on miss same Cleaner code, library-dependent.
Write-through — app writes cache → cache writes DB synchronously Consistent cache, slower writes.
Write-behind — app writes cache; DB written async Fast writes; risk on crash.
Eviction policies
LRU — discard least-recently used. Good general default.
LFU — least-frequently used. Better for skewed access.
TTL — time-based expiry. Combine with LRU.
FIFO — simple ring; ignores access patterns.
Manual — explicit cache.delete(key) on write.
Event-based — pub/sub invalidates across instances.
Invalidation patterns
# by key
cache.delete(f"user:{id}")
# by tag (Redis-stack, Varnish)
cache.delete_by_tag(f"user:{id}")
# fan-out (pub/sub)
pubsub.publish("cache:invalidate", {"keys": [f"user:{id}"]})
# versioned key — never need to delete
cache.set(f"user:{id}:v{version}", data)
# bump version on write → old key TTLs out naturally
Use-case recipes
Hot read DB joins → store materialized join in Redis hash; TTL 60s.
API responses → cache JSON by URL + auth-scope; vary on user.
Session → Redis with TTL = session lifetime; sliding refresh.
Rate limit counters → Redis INCR with EXPIRE.
Idempotency keys → Redis 24h to dedupe POST retries.
Thundering herd: hot key expires → 10k clients hit DB simultaneously. Fix: stale-while-revalidate, request coalescing (singleflight), or jittered TTLs.
MODULE 13 — ASYNC WORK
Queues, Background Jobs, Emails
Don't make user wait. Hand off to workers.
What belongs off request path
Email / SMS / push notifications.
Image / video transcoding, thumbnail generation.
Third-party API calls (especially slow/unreliable ones).
Heavy DB aggregations, report generation.
Webhook delivery to customers (with retries).
Periodic maintenance: backups, cleanups, log rotation.
Architecture
┌─────────┐ enqueue ┌────────┐ pull ┌────────┐
│Producer │ ─────────────► │ Broker │ ───────────►│Consumer│
│ (API) │ │ (Redis,│ │(worker)│
└─────────┘ │ SQS, │◄─── ack ────└────┬───┘
│ Kafka) │ │
└────────┘ ▼
side effects:
DB, email, S3, API
Broker comparison
Redis (BullMQ, Sidekiq) RabbitMQ SQS Kafka
Model list/stream AMQP exchanges distributed queue partitioned log
Order FIFO per queue FIFO per queue FIFO queue type FIFO per partition
Durability RDB/AOF persistent queues multi-AZ replicated log
Replay — — — full
Best for web apps complex routing AWS-native, simple event-sourcing, analytics
Job semantics
At-least-once delivery is realistic default → jobs must be idempotent .
Idempotency key on producer side: stable hash of payload → dedupe at consumer.
Retries with exponential backoff + jitter; cap attempts; route exhausted → DLQ.
Dead Letter Queue : failures land here for human inspection / replay.
Visibility timeout (SQS) / ack-window: if consumer crashes mid-job, message re-appears.
Chaining & concurrency
# BullMQ-style flow
const flow = new FlowProducer()
await flow.add({
name: 'order-complete',
queueName: 'orders',
children: [
{ name: 'charge-card', queueName: 'payments' },
{ name: 'send-receipt', queueName: 'email' },
{ name: 'update-warehouse',queueName: 'inventory'},
],
})
// parent runs only after all children succeed
Transactional email anatomy
Subject: Your order #4582 is confirmed
Preheader: Track shipping below • Need help? Reply to this email.
Body:
Hi Alice, thanks for your order...
[ Track shipment ] ← single CTA
Footer: Unsubscribe • Address (CAN-SPAM)
Templating with merge vars ({{firstName}}); HTML + plaintext multipart.
SPF + DKIM + DMARC on sending domain — else inbox spam.
Track bounce/complaint webhooks; suppress repeats.
Scheduling
Cron (system / Kubernetes CronJob) — periodic. Use UTC, deal with DST in app code.
Delayed jobs — schedule for future timestamp; broker handles dispatch.
Distributed lock for "run on exactly one instance" — Redis SET NX, etcd lease.
Interview lens: "User signs up — what happens after 201?" → enqueue: welcome email, analytics event, CRM upsert, search-index insert. Request returns immediately.
MODULE 14 — SEARCH
Elasticsearch
Inverted index for full-text + analytics at scale.
Internals
Inverted index : term → list of docs containing it. Built from tokenized + analyzed text.
Segment : immutable Lucene chunk on disk. Writes create new segments; periodic merge.
Shard : a Lucene index. ES index is N shards distributed across nodes.
Replica : shard copy for HA + read scaling.
Term frequency (TF) + IDF + length norm = BM25 relevance score (default).
Use cases
Type-ahead / autocomplete (edge n-grams or completion suggester).
Full-text product / article search with relevance.
Log analytics — ELK / OpenSearch ingesting JSON logs.
Fuzzy matching (typos, "did you mean").
Aggregations: top-N, time-series buckets, percentiles.
Query patterns
POST /products/_search
{
"query": {
"bool": {
"must": [{ "match": { "title": "wireless headphones" } }],
"filter": [
{ "term": { "category": "audio" } },
{ "range": { "price": { "lte": 200 } } }
],
"should": [
{ "match": { "brand": "sony" } } // boost
]
}
},
"aggs": {
"by_brand": { "terms": { "field": "brand.keyword" } },
"price_p": { "percentiles": { "field": "price" } }
},
"size": 20,
"from": 0
}
Field mapping rules
Need Mapping
Full-text search "type": "text" with analyzer
Exact match / sort / aggregate "type": "keyword"
Range queries integer, date, double
Both above (common) text with fields.keyword multi-field
Geo geo_point for lat/lon
Tuning
Define explicit mappings up-front — dynamic mapping creates fields that explode index size.
Use filter context (bool.filter) for yes/no — skips scoring, cacheable.
Shard count: hard to change after creation. Aim ~10–50 GB per shard.
Kibana for ad-hoc exploration; not for prod query path.
Gotcha: ES is near-real-time . Default refresh interval = 1s. Bulk-index then ?refresh=wait_for if you must read your write.
MODULE 15 — FAILURES
Error Handling
Errors are first-class output. Plan them.
Error categories
Type When Strategy
Syntax Compile / parse time Lint + CI catch.
Runtime — transient Network blip, DB locked Retry with backoff + circuit breaker.
Runtime — permanent Bad input, missing record Fail fast, return 4xx.
Logical / business Insufficient funds, conflict Domain error → 4xx with code.
System Out of memory, disk full Crash + restart + alert.
Strategies
Fail fast — invalid input rejected at edge; cheaper than half-applied work.
Fail safe — on unknown error, deny access (auth failures); for non-critical features, degrade.
Graceful degradation — recs unavailable? Show empty section, not 500.
Circuit breaker — after N failures to dependency, open circuit for cool-down. Prevents cascading.
Custom error types
class AppError(Exception):
code: str
status: int
message: str
cause: Exception | None = None
class NotFound(AppError):
status = 404
code = "NOT_FOUND"
class RateLimited(AppError):
status = 429
code = "RATE_LIMITED"
Global handler
@app.errorhandler(Exception)
def handle(e):
request_id = g.get("request_id")
if isinstance(e, AppError):
log.warn({"code": e.code, "rid": request_id})
return jsonify(error=e.code, message=e.message), e.status
log.error({"rid": request_id}, exc_info=True)
return jsonify(error="INTERNAL", message="something went wrong"), 500
User-facing error response
{
"error": "INSUFFICIENT_FUNDS",
"message": "Balance $20.00 below $50.00 needed.",
"requestId": "7f3a-1c2b",
"docsUrl": "https://x.com/docs/errors#insufficient_funds"
}
Monitoring & alerting
Sentry / Bugsnag / Rollbar — exception capture with stack + breadcrumbs.
ELK / Loki / Datadog — log aggregation, search by request-id.
PagerDuty / Opsgenie — paging for SLO violations.
Alert on symptoms (latency, error rate) not causes (CPU). Causes are runbook context.
MODULE 16 — CONFIG
Config Management
Separate config from code. Environment-aware. Secrets isolated.
Config types
Type Examples Where
Static retry counts, page size, timeouts YAML / JSON in repo
Environment-specific DB URL, Redis host, log level env vars / per-env file
Sensitive API keys, signing secrets, DB creds secret manager (Vault, AWS SM, GCP SM, sops)
Dynamic feature flags, kill switches LaunchDarkly, Unleash, Flagsmith, ConfigCat
Precedence (12-factor)
defaults (in code)
↓ overridden by
config file (config.yaml)
↓ overridden by
env vars (DATABASE_URL=...)
↓ overridden by
command-line flags (--port 8080)
.env workflow
# .env (NEVER commit)
DATABASE_URL=postgres://app:secret@localhost/app
JWT_SECRET=hunter2
# .env.example (commit this)
DATABASE_URL=
JWT_SECRET=
# loading
load_dotenv()
db_url = os.environ["DATABASE_URL"] # crash fast on missing
log_level = os.environ.get("LOG_LEVEL", "info")
Feature flags
if flags.enabled("new-checkout", user_id=user.id):
return new_checkout_flow(order)
return legacy_checkout_flow(order)
# rollout patterns:
# percentage: 10% of users
# targeting: users in cohort "beta"
# kill switch: instantly disable broken feature without redeploy
Secret rotation
Read secret at startup; cache. Reload on SIGHUP or scheduled refresh.
Support two valid versions during rotation (overlap window).
Audit-log every secret access.
Never: hardcode prod secrets; commit .env; print env in logs; ship secrets in container images.
MODULE 17 — OBSERVABILITY
Logging, Monitoring, Tracing
Three pillars: logs (events), metrics (aggregates), traces (causal chains).
Logs
Levels
Level Use Alert?
DEBUG Dev troubleshooting No
INFO Lifecycle: startup, shutdown, important business events No
WARN Recoverable, degraded Trend
ERROR Failed request, exception Yes if rate spikes
FATAL Process can't continue Page
Structured logging
# DON'T
log.info(f"user {uid} placed order {oid} for ${amt}")
# DO — JSON keys are queryable
log.info("order_placed", extra={
"user_id": uid,
"order_id": oid,
"amount": amt,
"request_id": rid,
"trace_id": tid,
})
What NOT to log
Passwords, tokens, full credit card numbers, PII without need.
Full request body on auth routes (passwords in body).
Stack traces to user-facing logs (info leak).
Rotation & retention
Rotate by size or daily; compress; ship to central store.
Retention by class: access logs 30d, audit 1–7y depending on compliance.
Metrics
Type Use Example
Counter Monotonic count http_requests_total{route="/users",code="200"}
Gauge Point-in-time value db_pool_in_use, queue_depth
Histogram Distribution http_duration_seconds_bucket
Summary Pre-computed quantiles p50 / p95 / p99 latency
RED / USE / Four Golden Signals
RED (request-oriented): Rate, Errors, Duration.
USE (resource-oriented): Utilization, Saturation, Errors.
Four golden signals : latency, traffic, errors, saturation.
Tracing
trace_id: 4f3a... (one per request, propagated across services)
├─ span A: api-gateway (1.2 ms)
├─ span B: auth-service (3.0 ms)
└─ span C: user-service (15.4 ms)
├─ span D: postgres (8.1 ms)
└─ span E: redis (0.3 ms)
Use OpenTelemetry SDK to emit spans; export to Jaeger, Tempo, Honeycomb.
W3C traceparent header propagates across HTTP / gRPC / queues.
Sample (head/tail) — 100% tracing too expensive; tail-based samples errors + slow tail.
Alert hygiene: page only what human must act on now. Everything else → ticket. Alert fatigue = missed incidents.
MODULE 18 — SHUTDOWN
Graceful Shutdown
Stop without dropping in-flight work.
Signals
Signal Number Catchable Sender
SIGTERM 15 ✓ k8s/systemd normal stop
SIGINT 2 ✓ Ctrl-C
SIGHUP 1 ✓ Reload config (convention)
SIGKILL 9 ✗ Force kill — no chance to clean up
Shutdown sequence
Mark unhealthy — readiness probe returns 503. LB stops sending new traffic.
Drain — keep serving in-flight requests; reject new ones (or 503).
Wait grace period — typically 10–30s.
Close external resources — DB pool drain, flush log buffers, close file handles, ack pending queue messages.
Exit 0 .
Pattern (Go)
srv := &http.Server{Addr: ":8080", Handler: mux}
go srv.ListenAndServe()
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
<-sigCh
ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
srv.Shutdown(ctx) // stop accepting, finish in-flight
db.Close()
log.Sync()
os.Exit(0)
Kubernetes specifics
K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL.
Add preStop hook to sleep few seconds — gives LB time to remove pod before SIGTERM.
Pod removal from Service endpoints is eventually consistent — that sleep helps.
MODULE 19 — PERFORMANCE
Scaling, Performance, Concurrency
Find bottleneck → fix smallest → measure → repeat.
Find bottleneck
Measure first. Profile (flame graph) before optimizing.
Response time breakdown : wait queue + compute + DB + downstream + serialization.
USE method : utilization, saturation, errors per resource (CPU, RAM, disk, net).
Top tools: pprof (Go), cProfile/py-spy (Python), Chrome DevTools (Node), async-profiler (Java).
DB optimization
N+1 — fetching N children individually after listing parents. Fix: eager-load / JOIN / dataloader pattern.
Indexes for read-heavy paths; benchmark EXPLAIN ANALYZE on real data shapes.
Batching — replace per-item INSERTs with bulk insert; reduce round-trips.
Read replicas — fan reads off primary; mind replication lag.
Sharding / partitioning when single node hits write ceiling.
App-level
Compress payloads (gzip/br) — usually 5–10× smaller on JSON.
Close file handles / connections — defer/finally/using/context-manager.
Avoid loading whole files in memory — stream.
Cache expensive computations (memoize). Beware staleness.
Graceful degradation under load — shed non-critical features (recs, analytics).
Concurrency vs parallelism
Concurrency Parallelism
What Multiple tasks interleaved on one core Multiple tasks on multiple cores
Wins on I/O-bound (DB, HTTP, file) CPU-bound (encoding, math)
Primitives async/await, goroutines, threads process pools, worker threads
Python asyncio, aiohttpmultiprocessing (GIL blocks CPU threads)
Node event loop (default) worker_threads, cluster
Go goroutines GOMAXPROCS = #CPUs
Scaling axes
Vertical — bigger box. Cheap until ceiling.
Horizontal — more boxes behind LB. Requires stateless app.
Functional — split monolith into services by domain.
Data — read replicas, sharding, CQRS, event sourcing.
Latency numbers (rough): L1 ref 1 ns · main memory 100 ns · SSD seq read 1 GB/s · LAN RTT 0.5 ms · same-region cloud 1 ms · cross-region 50–150 ms · disk seek 10 ms.
MODULE 20 — INTEGRATIONS
Advanced Integrations
Big files, real-time, push patterns.
Object storage (S3 et al.)
Direct upload via pre-signed URLs — client uploads to S3 directly. Server never sees bytes.
Multipart upload for files > 100 MB: split into ≥ 5 MB parts; parallel; resumable.
Streaming — pipe S3 → response without buffering full file in memory.
Lifecycle : transition to IA/Glacier; expire old objects automatically.
Versioning + MFA delete for compliance buckets.
# pre-signed PUT
url = s3.generate_presigned_url(
"put_object",
Params={"Bucket": "uploads", "Key": f"users/{uid}/{uuid}.jpg",
"ContentType": "image/jpeg"},
ExpiresIn=900,
)
# client then: PUT url -H "Content-Type: image/jpeg" --data-binary @file
Real-time
WebSockets Server-Sent Events (SSE) Long polling
Direction Bidirectional Server → client Client polls; server holds
Transport WS upgrade over TCP HTTP keep-alive + text/event-stream HTTP
Reconnect App-level Built-in (Last-Event-ID) Per-poll
Use Chat, games, collab Notifications, dashboards Legacy/fallback
Pub/Sub architecture
Producers publish events to topics; subscribers consume independently.
Decouples services — publisher doesn't know who listens.
Brokers: Redis Pub/Sub (fire-and-forget), Kafka (replayable), GCP Pub/Sub, NATS.
Webhooks (server-initiated)
Polling Webhook
Initiator Consumer Producer
Latency Interval Near-instant
Cost Wasted polls Pay per event
Reliability Easy (idempotent reads) Hard (retries, DLQ, signatures)
Outbound webhook checklist
HTTPS only. Sign payload (HMAC-SHA256) — receiver verifies.
Include unique event ID + timestamp (replay protection).
Retry with exponential backoff (e.g., 1m, 5m, 30m, 6h, 24h); DLQ after.
Expose dashboard so customers can see deliveries, replay manually.
Test locally with ngrok / cloudflared tunnels.
# signing
sig = hmac.new(secret, body, sha256).hexdigest()
headers = {
"X-Webhook-Signature": f"sha256={sig}",
"X-Webhook-Timestamp": ts,
"X-Webhook-Id": event_id,
}
# verifying — constant-time compare!
expected = "sha256=" + hmac.new(secret, body, sha256).hexdigest()
hmac.compare_digest(expected, headers["X-Webhook-Signature"])
MODULE 21 — CONTRACT
OpenAPI Standards
Spec-first APIs: describe → generate clients/servers/docs/tests.
Why API-first
Single source of truth — frontend, backend, partners all consume same spec.
Parallel work — UI mocks against spec while server is built.
Auto-generated clients (TS, Python, Java, Go) — no hand-written HTTP.
Auto-generated server stubs and request validators.
Diffable in code review — breaking changes are visible.
Evolution
Swagger 2.0 (2014) → OpenAPI 3.0 (2017) → 3.1 (2021, aligned with JSON Schema 2020-12).
Tools: Swagger UI, Redoc (docs), Postman/Insomnia (test), oapi-codegen / openapi-typescript / openapi-python-client (codegen).
Document anatomy
openapi: 3.1.0
info:
title: Orders API
version: 1.4.0
servers:
- url: https://api.x.com/v1
paths:
/orders/{id}:
get:
operationId: getOrder
parameters:
- name: id
in: path
required: true
schema: { type: string, format: uuid }
responses:
'200':
description: OK
content:
application/json:
schema: { $ref: '#/components/schemas/Order' }
'404': { $ref: '#/components/responses/NotFound' }
security:
- bearerAuth: []
components:
schemas:
Order:
type: object
required: [id, status, total]
properties:
id: { type: string, format: uuid }
status: { type: string, enum: [pending, paid, shipped] }
total: { type: number, minimum: 0 }
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
Best practices
Keep spec in repo next to code. Lint with Spectral.
CI step: regenerate clients on every change; fail PR if drift.
Use operationId consistently — codegen names methods from it.
Define error schemas once in components/responses; reference from each endpoint.
Version via URL segment (/v1); bump on breaking change.
MODULE 22 — DELIVERY
Testing, Code Quality, DevOps
Ship safely. Verify automatically. Operate continuously.
Test types & pyramid
Type Scope Speed Quantity
Unit Pure function / class ms Many (base of pyramid)
Integration Module + real DB / queue sec Some
Contract Service boundary (consumer-driven) sec Per-pair
End-to-end Full user flow through UI min Few (top of pyramid)
Load / stress System under traffic min–hr Pre-release
UAT Real users on staging days Per release
Security SAST, DAST, dep scan, pentest — Continuous + pre-release
TDD cycle
Red — write failing test for desired behavior.
Green — minimum code to pass.
Refactor — clean up; tests still pass.
CI/CD pipeline shape
push / PR ─► lint ─► unit ─► build ─► integration ─► sec-scan ─► sign image ─► deploy to staging ─► smoke tests ─► (manual approval?) ─► deploy to prod ─► rollout watch (auto-rollback on SLO breach)
Code quality
Lint — eslint, ruff, golangci-lint, rubocop. Run pre-commit + CI.
Format — prettier, black, gofmt. Don't argue style; let tools settle it.
Cyclomatic complexity — keep functions < 10–15; split when over.
Coverage — useful as floor (e.g., 70%), useless as gospel.
Mutation testing — flips operators; checks tests actually catch.
12-Factor App
Codebase: one app, one repo, many deploys.
Dependencies: explicit, isolated (lockfile).
Config: in environment, never in code.
Backing services: treat as attached resources (DB, queue swap by URL).
Build → Release → Run: strict separation.
Processes: stateless, share-nothing.
Port binding: app exports HTTP itself.
Concurrency: scale via process model.
Disposability: fast startup, graceful shutdown.
Dev/prod parity: same OS, services, data shape.
Logs: stream to stdout; let platform aggregate.
Admin processes: one-off scripts run in same env.
DevOps stack
Layer Tools
IaC Terraform, Pulumi, CDK, CloudFormation
Containers Docker / OCI, BuildKit, buildpacks
Orchestration Kubernetes, ECS, Nomad
CI/CD GitHub Actions, GitLab CI, ArgoCD, Flux
Secrets Vault, AWS SM, sops, sealed-secrets
Observability Prometheus, Grafana, Loki, Jaeger, Datadog
Deployment strategies
Strategy How Rollback Risk
Recreate Stop v1, start v2 Stop v2, start v1 Downtime — small apps only.
Rolling Replace pods batch-by-batch Roll back same way Mixed versions during rollout.
Blue/Green Stand up v2 fully; flip LB Flip LB back to v1 2× capacity briefly.
Canary v2 to 1%/5%/25%/100% Halt rollout, drain canary Need automated metrics gate.
Feature flag Deploy dark; toggle on for cohort Toggle off — no redeploy Flag-debt if not cleaned up.
Horizontal vs vertical scaling
Vertical — bigger instance. Fast win, hits ceiling, single point of failure.
Horizontal — more instances + LB. Requires stateless app + shared DB/cache. Near-linear scaling until DB becomes bottleneck.
Interview lens: "How do you deploy safely?" → small batches, automated gates (error rate, p99), canary with auto-rollback on SLO breach, feature flags to decouple deploy from release.