Backend Engineering — First Principles Notebook

MODULE 01 — FOUNDATIONS

Request Flow End-to-End

Browser → DNS → TCP → TLS → HTTP → LB → server → response. Every hop matters.

The full path

browser │ ├─ DNS lookup (recursive resolver → root → TLD → authoritative) │ ├─ TCP 3-way handshake (SYN → SYN-ACK → ACK) [1 RTT] ├─ TLS 1.3 handshake (ClientHello → cert → finished) [1 RTT] │ ├─ HTTP request bytes ──► public internet ──► ISP ──► transit ──► cloud edge │ │ │ ▼ │ CDN / Cloudflare / AWS edge │ │ │ ▼ │ Load Balancer (L7) │ │ │ ▼ │ Application server │ (routing → middleware │ → controller → service │ → DB / cache / queue) │ │ ◄────────────────── HTTP response (status, headers, body) ◄───────────┘

Hops & what each does

Hop	Layer	Job
DNS	App	Resolve `api.example.com` → IP. Cached at OS, browser, resolver.
Firewall / NAT	L3/L4	SNAT private → public IP. Drops disallowed traffic.
CDN edge	L7	Serve static/cached. Terminate TLS close to user.
WAF	L7	OWASP rules: block SQLi/XSS patterns, geo blocks.
Load balancer	L4 / L7	Spread across servers. Health-check pools. Sticky or stateless.
API gateway	L7	Auth, rate limit, transform, route to upstream services.
App server	L7	Run your code: middleware chain → handler → response.

The response: structure

HTTP/2 200 OK
content-type: application/json; charset=utf-8
content-length: 87
cache-control: private, max-age=60
x-request-id: 7f3a-1c2b
date: Mon, 11 May 2026 09:14:22 GMT

{"id": 42, "name": "alice", "email": "a@x.com"}

MODULE 02 — PROTOCOL

HTTP Protocol

Message structure, headers, methods, CORS, status codes, caching, versions, TLS.

Raw message format

# request
POST /api/users HTTP/1.1
Host: api.example.com
Content-Type: application/json
Authorization: Bearer eyJ...
Content-Length: 41

{"email":"a@x.com","password":"hunter2"}

# response
HTTP/1.1 201 Created
Content-Type: application/json
Location: /api/users/42

{"id":42,"email":"a@x.com"}

Header families

Family	Examples	Purpose
Request	`Host`, `User-Agent`, `Accept`, `Authorization`	Describe sender + intent
Representational	`Content-Type`, `Content-Encoding`, `Content-Length`, `ETag`	Describe body bytes
General	`Date`, `Connection`, `Cache-Control`, `Via`	Apply both directions
Security	`Strict-Transport-Security`, `X-Frame-Options`, `Content-Security-Policy`, `X-Content-Type-Options`	Browser hardening

Methods & semantics

Method	Safe	Idempotent	Body	Use
`GET`	✓	✓	—	Read resource
`HEAD`	✓	✓	—	Headers only (existence/size check)
`OPTIONS`	✓	✓	—	CORS pre-flight, capability discovery
`POST`	✗	✗	✓	Create / non-idempotent action
`PUT`	✗	✓	✓	Full replace at known URI
`PATCH`	✗	✗*	✓	Partial update (*idempotent w/ JSON Merge Patch)
`DELETE`	✗	✓	—	Remove resource

CORS — Cross-Origin Resource Sharing

Browser enforces same-origin policy. Server opts other origins in via headers.

Simple request

Methods GET/HEAD/POST only.
Only "safelisted" headers (Accept, Content-Language, Content-Type: text/plain | application/x-www-form-urlencoded | multipart/form-data).
Browser sends directly with Origin: https://app.x.com. Server returns Access-Control-Allow-Origin.

Pre-flight

# browser sends first:
OPTIONS /api/users HTTP/1.1
Origin: https://app.x.com
Access-Control-Request-Method: PUT
Access-Control-Request-Headers: authorization, content-type

# server replies:
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://app.x.com
Access-Control-Allow-Methods: GET, POST, PUT, DELETE
Access-Control-Allow-Headers: authorization, content-type
Access-Control-Allow-Credentials: true
Access-Control-Max-Age: 86400

Status codes — ones that matter

Range	Code · meaning
2xx	`200` OK · `201` Created · `202` Accepted (async) · `204` No Content · `206` Partial (range)
3xx	`301` Moved Permanent · `302` Found · `304` Not Modified (ETag hit) · `307/308` preserve method on redirect
4xx	`400` Bad Request · `401` Unauthorized (= unauthenticated) · `403` Forbidden · `404` Not Found · `409` Conflict · `422` Unprocessable · `429` Too Many Requests
5xx	`500` Internal Error · `502` Bad Gateway · `503` Service Unavailable · `504` Gateway Timeout

Caching: ETag vs max-age

# first response carries ETag
HTTP/1.1 200 OK
ETag: "v1-7f3a"
Cache-Control: max-age=60, must-revalidate

# subsequent request — conditional
GET /api/users/42
If-None-Match: "v1-7f3a"

# server unchanged → no body
HTTP/1.1 304 Not Modified
ETag: "v1-7f3a"

Strong validators: ETag (byte-exact), Last-Modified (second precision).
Freshness: max-age=N = fresh for N sec, no server hit.
private = only end-user cache. public = CDN can cache.
no-store = never persist (sensitive data). no-cache = revalidate every time.

HTTP versions

Version	Transport	Multiplexing	Head-of-line	Header compression
HTTP/1.1	TCP, plaintext	One req/conn (pipelining broken)	App-layer	None
HTTP/2	TCP, binary frames	Streams over 1 conn	TCP-level still blocks	HPACK
HTTP/3	QUIC over UDP	Independent streams	None (per-stream loss)	QPACK

Content negotiation & compression

Accept: application/json;q=0.9, application/xml;q=0.5
Accept-Encoding: gzip, br, zstd
Accept-Language: en-US, en;q=0.8

# server picks best match, replies:
Content-Type: application/json
Content-Encoding: br

TLS / HTTPS

TLS provides confidentiality (encryption), integrity (MAC), authentication (cert chain).
TLS 1.3 handshake: 1 RTT (vs 2 in 1.2), 0-RTT for session resumption.
Cert chain: leaf → intermediate → root (trust anchor in OS).
SNI lets one IP host many TLS sites (sends hostname in ClientHello).
HSTS header Strict-Transport-Security: max-age=31536000; includeSubDomains; preload forces HTTPS for year.

Gotcha: 401 Unauthorized is misnomer — means unauthenticated. Use 403 when caller is authenticated but lacks permission.

MODULE 03 — DISPATCH

Routing

URL → handler. Method-aware. Versioned. Grouped. Fast.

Route components

GET /api/v1/users/:userId/posts?status=published&limit=20
     │   │   │     │           │
     │   │   │     │           └─ query params (filters, paging)
     │   │   │     └─ path param (resource id)
     │   │   └─ resource (collection)
     │   └─ version
     └─ namespace

Route types

Type	Example	Notes
Static	`/health`	O(1) hash lookup possible.
Dynamic	`/users/:id`	Param capture. Most frameworks use radix/trie.
Nested / hierarchical	`/orgs/:org/teams/:team/members`	Authorization often cascades.
Catchall / wildcard	`/files/*path`	Greedy — last priority.
Regex	`/users/{id:\d+}`	Type-narrowed. Powerful, slower.

API versioning strategies

Strategy	Example	Pros / Cons
URI	`/v1/users`	+ Visible, cache-friendly. − Many URLs.
Header	`API-Version: 2`	+ Clean URL. − Hidden in tooling.
Query	`?v=2`	+ Trivial. − Breaks caching on shared keys.
Media type	`Accept: application/vnd.x.v2+json`	+ RESTful. − Hardest to test.

Deprecation pattern

HTTP/1.1 200 OK
Deprecation: true
Sunset: Wed, 01 Jan 2027 00:00:00 GMT
Link: <https://api.x.com/v2/users>; rel="successor-version"
Warning: 299 - "v1 deprecated; migrate to v2 by 2027-01-01"

Route grouping

# pseudo-framework
group("/api/v1", middleware=[logger, requestId]) {
  group("/auth", middleware=[rateLimit(5, "1m")]) {
    POST("/login",   loginHandler)
    POST("/refresh", refreshHandler)
  }
  group("/admin", middleware=[requireAuth, requireRole("admin")]) {
    GET("/users",         listUsers)
    DELETE("/users/:id",  deleteUser)
  }
}

MODULE 04 — DATA ON THE WIRE

Serialization & Deserialization

Native ↔ wire bytes. Pick format by audience + perf budget.

Text vs binary

	JSON	XML	Protobuf	MessagePack	Avro
Readable	✓	✓	✗	✗	✗
Schema	optional	XSD	required	none	required
Size	baseline	1.5–2×	0.2–0.5×	0.5×	0.3×
Parse speed	baseline	slow	10–20× faster	5×	10×
Use	web APIs	legacy/SOAP	gRPC, internal	cache, RPC	Kafka, big data

JSON deep-dive

{
  "string": "hello",
  "int":    42,
  "float":  3.14,
  "bool":   true,
  "null":   null,
  "array":  [1, 2, 3],
  "nested": { "k": "v" },
  "date":   "2026-05-11T09:14:22Z"   // ISO-8601 with offset
}

Native mapping

JSON	Python	Go	JS/TS
object	`dict`	`struct` / `map[string]any`	`object`
array	`list`	`[]T`	`Array`
number	`int`/`float`	`float64` (or typed)	`number`
null	`None`	`nil` / zero / pointer	`null`

Edge cases

Missing fields — apply defaults; use Optional[T]/pointers to distinguish absent vs null.
Extra fields — strict mode reject, lenient ignore. Default to reject for inbound user data.
Numbers — JS number = float64, loses precision past 2^53. Send big ints as strings.
Dates — ISO-8601 with timezone offset (Z = UTC). Never plain "2026-05-11 09:14:22".
Floats — money in cents (integer) or decimal-string. Never float for currency.
Null vs absent — for PATCH, "absent" = leave alone, "null" = clear field.

Schema validation (JSON Schema)

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["email", "age"],
  "additionalProperties": false,
  "properties": {
    "email": { "type": "string", "format": "email", "maxLength": 254 },
    "age":   { "type": "integer", "minimum": 0, "maximum": 150 },
    "tags":  { "type": "array", "items": { "type": "string" }, "maxItems": 20 }
  }
}

Insecure deserialization: language-native binary formats (Python's pickle, Java ObjectInputStream, PHP unserialize) can execute attacker-supplied code on parse. Never deserialize untrusted input with binary native formats. JSON-only at external boundaries.

MODULE 05 — TRUST

Authentication, Authorization, Security

Who you are, what you can do, how attackers try to get past it.

Authentication mechanisms

Mechanism	State	How	Use
Basic auth	stateless	`Authorization: Basic base64(user:pass)`	Internal/dev. HTTPS mandatory.
API key	stateless	Long random string per client	Server-to-server, partner APIs.
Session cookie	stateful (server)	Random ID → server lookup	Classic web apps.
JWT (bearer)	stateless	Signed claims in token	SPAs, mobile, microservices.
OAuth 2.0	delegated	Authorization code → access token	Third-party app access.
OIDC	OAuth + identity	OAuth + `id_token` JWT	SSO / "Login with Google".
MFA	+factor	TOTP, WebAuthn, SMS (weak)	High-value accounts.

JWT anatomy

header.payload.signature

# header
{ "alg": "RS256", "typ": "JWT", "kid": "key-2026-q2" }

# payload (claims)
{
  "sub": "user_42",
  "iss": "https://auth.x.com",
  "aud": "api.x.com",
  "exp": 1715420000,
  "iat": 1715416400,
  "scope": "read:users write:posts"
}

# signature = sign(base64(header) + "." + base64(payload), private_key)

JWT pitfalls

Algorithm confusion: validate alg against allow-list. Reject none. Don't trust kid blindly.
No revocation: tokens valid until exp. Use short TTL (5–15 min) + refresh tokens for logout.
Stored payload is public: signed, not encrypted. Don't put secrets in claims.
Clock skew: allow ±60s on exp/nbf.

Password storage

# NEVER: plaintext, MD5, SHA-1, SHA-256 plain
# CORRECT: slow KDF with per-user salt
hash = argon2id(password, salt, m=64MB, t=3, p=1)
# alternatives: bcrypt (cost ≥ 12), scrypt, PBKDF2 (≥ 600k iter)

Authorization models

Model	Decision input	Example
RBAC role-based	(user, role) → perms	admin, editor, viewer.
ABAC attribute-based	(user.attrs, resource.attrs, env) → allow?	"engineer in same dept can read".
ReBAC relationship-based	graph: user → owns → doc	Google Docs sharing. Zanzibar.

OWASP-style attacks & defenses

Attack	Mechanism	Defense
SQL injection	Untrusted input concatenated into SQL	Parameterized queries / prepared statements. Never string-interpolate.
NoSQL injection	Object-shaped input replaces operators	Schema validate. Reject objects where strings expected.
XSS	Untrusted HTML rendered	Context-aware escaping. CSP header. `HttpOnly` cookies.
CSRF	Browser auto-sends cookies	SameSite=Lax/Strict, CSRF tokens, double-submit, Origin check.
MITM	Network attacker reads traffic	TLS everywhere, HSTS, cert pinning for mobile.
Insecure deserialization	Native binary parsers on untrusted input	JSON only for untrusted; signed payloads for internal.
SSRF	Server fetches attacker URL	Allow-list URLs. Block link-local + metadata IPs (169.254.169.254).
IDOR	Predictable IDs without authz check	Check ownership server-side every request. UUIDs help defense-in-depth.

Secure design principles

Least privilege — give each subject minimum it needs.
Defense in depth — overlapping layers (WAF, app validation, DB constraints).
Fail secure — when in doubt, deny. Don't open-default on errors.
Separation of duties — same human can't approve and execute payouts.
CSP — Content-Security-Policy: default-src 'self'; script-src 'self' 'nonce-xyz'.
SameSite cookies — Set-Cookie: session=...; HttpOnly; Secure; SameSite=Lax.

Attack prevention practices

Audit-log failed logins, privilege escalations, admin actions. Tamper-evident store.
Generic error messages on auth ("invalid credentials" — don't leak which part wrong).
Rate limit per-IP + per-account. Exponential backoff. Lock after N failures.
Constant-time compare for tokens/HMAC (hmac.compare_digest) — avoid timing attacks.

MODULE 06 — INPUT HYGIENE

Validation, Transformation, Normalization

Fail fast on bad input. Normalize before processing. Sanitize before storing.

Three validation types

Type	What it checks	Examples
Type	Right shape	String not array; integer not string.
Syntactic	Right format	Email regex, UUID, ISO date, phone.
Semantic	Right meaning	Age 0–150; `endDate > startDate`; SKU exists in catalog.

Client vs server

Client-side validation = UX (instant feedback). Cannot be trusted.
Server-side validation = security. Always re-validate.
Fail fast: validate at edge, before middleware does work.

Transform & normalize

email   = email.strip().lower()
phone   = re.sub(r'\D', '', phone)        # digits only
country = country.upper()                  # "us" → "US"
name    = ' '.join(name.split())           # collapse spaces
slug    = slugify(title)                   # "Hello World!" → "hello-world"

Sanitization (escape, don't trust)

# HTML
clean_html = bleach.clean(user_html, tags=['p','b','i'], strip=True)
# SQL — never string-format
cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Shell — avoid; if must, use shlex.quote

Complex rules

Relationship: password == confirmPassword.
Conditional: if type == "business", then taxId required.
Chained: parse → type-check → range-check → cross-field check.

Error aggregation

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json

{
  "type": "https://x.com/errors/validation",
  "title": "Validation failed",
  "status": 422,
  "errors": [
    { "field": "email", "code": "invalid_format", "message": "not a valid email" },
    { "field": "age",   "code": "out_of_range",   "message": "must be 0–150" }
  ]
}

MODULE 07 — PIPELINE

Middleware

Cross-cutting logic in chain. Order matters more than content.

What middleware does

Run code before handler (parse, auth, log start).
Run code after handler (log status, add headers, compress).
Short-circuit — return early (401, 429, 404) without calling next.

Canonical ordering

request ──► recovery (panic/exception catcher) ──► requestId / traceId ──► access log start ──► CORS ──► security headers (HSTS, X-Content-Type-Options, CSP) ──► body parser (json, urlencoded, multipart) ──► compression negotiation ──► rate limiter ──► authentication ──► authorization ──► validation ──► route ──► handler ──► response ◄── log finish (status, duration) ◄── error handler (if thrown)

Common middlewares

Type	Examples
Security	helmet (sets headers), CSRF, CORS
Parsing	JSON, urlencoded, multipart (file upload)
Auth	JWT verify, session lookup, API-key check
Rate limit	token bucket per IP/user/route
Logging	access log, request-id propagation
Compression	gzip/br based on `Accept-Encoding`
Error	centralized handler — maps exceptions to status codes

Keep middleware lightweight

Every middleware runs on every request. 1 ms each × 10 middlewares = 10 ms baseline.
Heavy work (image processing, external API calls) belongs in handler/job, not middleware.
Cache decisions (e.g., JWKS public keys) — don't refetch per request.

MODULE 08 — STATE

Request Context

Per-request scratch space that flows with the call — without leaking across requests.

What lives in context

Metadata: URL, headers, method, remote IP, start time.
Identity: userId, orgId, scopes after auth middleware.
Tracing: requestId, traceId, spanId for correlation.
Cancellation: timeout / abort signal for downstream calls.
DB conn / transaction: scoped per request so all reads see same snapshot.

Patterns

Language	Pattern
Go	`context.Context` as first arg. `ctx.WithValue`, `ctx.Done()`, `ctx.Deadline`.
Node	`AsyncLocalStorage` (avoids passing through every layer).
Python	`contextvars.ContextVar` (async-safe).
Java	`ThreadLocal` (blocking) / `Context` with reactive frameworks.

Request ID propagation

# inbound middleware
const reqId = req.headers['x-request-id'] ?? randomUUID()
res.setHeader('x-request-id', reqId)
ctx.set('requestId', reqId)
logger.child({ reqId })

# outbound HTTP calls
fetch(url, { headers: { 'x-request-id': ctx.get('requestId') }})

Timeouts & cancellation

# Go
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
row := db.QueryRowContext(ctx, "SELECT ...")    # aborts on timeout

# Node
const ctrl = new AbortController()
setTimeout(() => ctrl.abort(), 2000)
await fetch(url, { signal: ctrl.signal })

MODULE 09 — STRUCTURE

MVC, Controllers, REST APIs

Separation of concerns inside request path.

Layered responsibility

Layer	Owns	Doesn't touch
Handler / Controller	Parse req, validate, call service, shape response	SQL, business rules
Service (business logic)	Use cases: `placeOrder`, `cancelSubscription`	HTTP, DB driver specifics
Repository / DAO	Persistence, queries, ORM calls	Business rules, HTTP
Model	Entity definition, invariants	I/O

CRUD ↔ HTTP mapping

POST   /users           → create
GET    /users           → list (paginated)
GET    /users/:id       → read
PUT    /users/:id       → full replace
PATCH  /users/:id       → partial update
DELETE /users/:id       → remove
POST   /users/:id/reset → action (non-CRUD verb)

Standard list response

{
  "data": [ { "id": 1, "name": "alice" }, ... ],
  "meta": {
    "page":  2,
    "limit": 20,
    "total": 137,
    "hasMore": true
  },
  "links": {
    "self": "/users?page=2&limit=20",
    "next": "/users?page=3&limit=20",
    "prev": "/users?page=1&limit=20"
  }
}

Pagination styles

Style	Pros	Cons
Offset (`?page=N`)	Simple, jump to page	Slow on big tables; inconsistent with writes
Cursor (`?after=cursor`)	Stable, fast, infinite-scroll friendly	No "jump to page N"
Keyset (`WHERE id > ?`)	Same as cursor; index-friendly	Requires sortable monotonic key

Search / sort / filter

GET /products?q=phone&category=electronics&minPrice=100&sort=-price,name&page=2

# parsed:
{
  q:          "phone",
  filters:    { category: "electronics", price: { gte: 100 } },
  sort:       [{ field: "price", dir: "desc" }, { field: "name", dir: "asc" }],
  page:       2
}

REST principles

Resource-oriented — nouns in URLs, verbs in methods.
Stateless — every request carries enough auth/context.
Cacheable — GETs use ETags / max-age.
Uniform interface — consistent shape across resources.
HATEOAS (optional) — responses embed links to next actions.
Redact sensitive fields — never serialize password_hash, even hashed.
OpenAPI spec — define contract first; generate client/server stubs.

MODULE 10 — PERSISTENCE

Databases

Storage shape, consistency, indexing, query plans, ORMs.

Relational vs non-relational

	Relational (Postgres, MySQL)	Document (Mongo)	Key-value (Redis, DynamoDB)	Wide-column (Cassandra)
Schema	fixed	flexible	none	row-flexible
Joins	strong	weak (lookup/aggregate)	none	none
Txn	full ACID	per-doc, multi-doc limited	per-key	per-row
Scale	vertical + read replicas	shard by key	horizontal	horizontal
Use	transactional, complex queries	nested objects, agile schema	cache, hot keys	time-series, massive write

ACID

Atomicity — all-or-nothing within txn.
Consistency — txn moves DB between valid states (constraints hold).
Isolation — concurrent txns don't see each other mid-flight. Levels: Read Uncommitted → Read Committed → Repeatable Read → Serializable.
Durability — committed data survives crash (fsync to disk).

CAP theorem

Under network Partition, must choose: Consistency (reject reads) or Availability (serve possibly-stale). Real systems pick on partition — most of the time partitions are rare and you have both.

CP: HBase, MongoDB (default), etcd, ZooKeeper.
AP: Cassandra, DynamoDB (tunable), Riak.
PACELC extension: even without partition (E), trade latency (L) vs consistency (C).

Indexing — rules

B-tree indexes power range + equality on leading column(s).
Composite index (a, b, c) serves WHERE a=?, WHERE a=? AND b=?, not WHERE b=? alone.
Covering index: include all SELECT columns → "index-only scan", no heap fetch.
Each index costs writes (update on insert/update/delete). Audit unused.
Hash indexes: equality only. GIN/GIST: full-text, JSONB, arrays. BRIN: huge append-only tables.

Query optimization workflow

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US' AND o.created_at > now() - interval '30 day'
GROUP BY u.name;

# look for:
#   Seq Scan on big table         → missing index
#   high "rows removed by filter" → predicate not pushed to index
#   Sort spilled to disk          → work_mem too low
#   Nested Loop on big rowcounts  → expected Hash/Merge join

Connection pooling

Opening Postgres conn ≈ 5–50 ms. Pool to reuse.
Pool size ≈ min(N_cores * 2 + spindles, db_max_connections / instances).
For serverless / many instances → use PgBouncer in transaction mode.
Always set acquire timeout + max-lifetime to recycle stale conns.

Constraints & transactions

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = $1;
  UPDATE accounts SET balance = balance + 100 WHERE id = $2;
  INSERT INTO transfers (from_id, to_id, amount) VALUES ($1, $2, 100);
COMMIT;

-- table constraints catch invariants:
CHECK (balance >= 0)
UNIQUE (email)
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE

ORMs & migrations

ORMs (Prisma, SQLAlchemy, GORM, Hibernate) trade SQL control for ergonomics.
Watch for N+1: users.all() then user.posts in loop → use eager load / JOIN.
Migrations: forward-only, idempotent, reviewed in code. Tools: flyway, alembic, knex, goose, prisma migrate.
Online schema changes for big tables: gh-ost, pt-osc, or expand-contract pattern.

MODULE 11 — DOMAIN

Business Logic Layer

Where rules live. Independent of HTTP and DB drivers.

Three-layer architecture

┌──────────────────────────────────┐ │ Presentation │ routes, controllers, DTOs, │ (HTTP / gRPC / CLI) │ validation, serialization └────────────┬─────────────────────┘ │ calls ┌────────────▼─────────────────────┐ │ Business Logic │ use-case services, domain models, │ (pure, framework-agnostic) │ rules, invariants, orchestration └────────────┬─────────────────────┘ │ uses ports ┌────────────▼─────────────────────┐ │ Data Access │ repositories, ORM, SQL, cache │ (Postgres / Redis / S3 / 3rd p.) │ adapters, external clients └──────────────────────────────────┘

Why split

Testability — unit-test business rules without spinning up HTTP/DB.
Reuse — same service from REST, gRPC, CLI, cron.
Swap adapters — Postgres → DynamoDB; HTTP → message queue. Only boundary changes.

SOLID applied

Principle	How it shows up
Single responsibility	One service = one use case. `RegisterUserService.handle()`.
Open/closed	New auth provider = new adapter implementing `AuthPort`; existing code unchanged.
Liskov	Any `UserRepository` impl must honor contract (same shapes/errors).
Interface segregation	`ReadOnlyUserRepo` vs full repo for handlers that only read.
Dependency inversion	Service depends on `EmailSenderPort` (interface), not `SendgridClient` (concrete).

Error propagation pattern

# BLL throws domain errors
class DomainError(Exception): ...
class NotFound(DomainError): ...
class Forbidden(DomainError): ...
class Conflict(DomainError): ...

# presentation layer maps to HTTP
{
  NotFound:  404,
  Forbidden: 403,
  Conflict:  409,
  Validation: 422,
  DomainError: 500,
}[type(e)]

Pro: domain errors carry semantic codes (USER_NOT_FOUND), not HTTP codes. HTTP mapping happens at edge only.

MODULE 12 — SPEED

Caching

Trade staleness for latency. Choose layer, strategy, eviction with intent.

Caching layers

Layer	Latency	Scope	Example
CPU L1/L2/L3	ns	Per-core	—
App in-memory	µs	Per-process	LRU map, Caffeine
Distributed	0.5–2 ms	Cluster	Redis, Memcached
CDN edge	1–30 ms	Region	Cloudflare, CloudFront
Browser	0 ms	Per-client	HTTP cache

Strategies

Strategy	Read	Write	Use
Cache-aside (lazy)	app checks cache → on miss reads DB → fills cache	app writes DB → invalidates cache	Default. Most common.
Read-through	cache lib reads DB on miss	same	Cleaner code, library-dependent.
Write-through	—	app writes cache → cache writes DB synchronously	Consistent cache, slower writes.
Write-behind	—	app writes cache; DB written async	Fast writes; risk on crash.

Eviction policies

LRU — discard least-recently used. Good general default.
LFU — least-frequently used. Better for skewed access.
TTL — time-based expiry. Combine with LRU.
FIFO — simple ring; ignores access patterns.
Manual — explicit cache.delete(key) on write.
Event-based — pub/sub invalidates across instances.

Invalidation patterns

# by key
cache.delete(f"user:{id}")

# by tag (Redis-stack, Varnish)
cache.delete_by_tag(f"user:{id}")

# fan-out (pub/sub)
pubsub.publish("cache:invalidate", {"keys": [f"user:{id}"]})

# versioned key — never need to delete
cache.set(f"user:{id}:v{version}", data)
# bump version on write → old key TTLs out naturally

Use-case recipes

Hot read DB joins → store materialized join in Redis hash; TTL 60s.
API responses → cache JSON by URL + auth-scope; vary on user.
Session → Redis with TTL = session lifetime; sliding refresh.
Rate limit counters → Redis INCR with EXPIRE.
Idempotency keys → Redis 24h to dedupe POST retries.

Thundering herd: hot key expires → 10k clients hit DB simultaneously. Fix: stale-while-revalidate, request coalescing (singleflight), or jittered TTLs.

MODULE 13 — ASYNC WORK

Queues, Background Jobs, Emails

Don't make user wait. Hand off to workers.

What belongs off request path

Email / SMS / push notifications.
Image / video transcoding, thumbnail generation.
Third-party API calls (especially slow/unreliable ones).
Heavy DB aggregations, report generation.
Webhook delivery to customers (with retries).
Periodic maintenance: backups, cleanups, log rotation.

Architecture

┌─────────┐ enqueue ┌────────┐ pull ┌────────┐ │Producer │ ─────────────► │ Broker │ ───────────►│Consumer│ │ (API) │ │ (Redis,│ │(worker)│ └─────────┘ │ SQS, │◄─── ack ────└────┬───┘ │ Kafka) │ │ └────────┘ ▼ side effects: DB, email, S3, API

Broker comparison

	Redis (BullMQ, Sidekiq)	RabbitMQ	SQS	Kafka
Model	list/stream	AMQP exchanges	distributed queue	partitioned log
Order	FIFO per queue	FIFO per queue	FIFO queue type	FIFO per partition
Durability	RDB/AOF	persistent queues	multi-AZ	replicated log
Replay	—	—	—	full
Best for	web apps	complex routing	AWS-native, simple	event-sourcing, analytics

Job semantics

At-least-once delivery is realistic default → jobs must be idempotent.
Idempotency key on producer side: stable hash of payload → dedupe at consumer.
Retries with exponential backoff + jitter; cap attempts; route exhausted → DLQ.
Dead Letter Queue: failures land here for human inspection / replay.
Visibility timeout (SQS) / ack-window: if consumer crashes mid-job, message re-appears.

Chaining & concurrency

# BullMQ-style flow
const flow = new FlowProducer()
await flow.add({
  name: 'order-complete',
  queueName: 'orders',
  children: [
    { name: 'charge-card',     queueName: 'payments' },
    { name: 'send-receipt',    queueName: 'email'    },
    { name: 'update-warehouse',queueName: 'inventory'},
  ],
})
// parent runs only after all children succeed

Transactional email anatomy

Subject:    Your order #4582 is confirmed
Preheader:  Track shipping below • Need help? Reply to this email.
Body:
  Hi Alice, thanks for your order...
  [ Track shipment ]    ← single CTA
Footer:     Unsubscribe • Address (CAN-SPAM)

Templating with merge vars ({{firstName}}); HTML + plaintext multipart.
SPF + DKIM + DMARC on sending domain — else inbox spam.
Track bounce/complaint webhooks; suppress repeats.

Scheduling

Cron (system / Kubernetes CronJob) — periodic. Use UTC, deal with DST in app code.
Delayed jobs — schedule for future timestamp; broker handles dispatch.
Distributed lock for "run on exactly one instance" — Redis SET NX, etcd lease.

MODULE 14 — SEARCH

Elasticsearch

Inverted index for full-text + analytics at scale.

Internals

Inverted index: term → list of docs containing it. Built from tokenized + analyzed text.
Segment: immutable Lucene chunk on disk. Writes create new segments; periodic merge.
Shard: a Lucene index. ES index is N shards distributed across nodes.
Replica: shard copy for HA + read scaling.
Term frequency (TF) + IDF + length norm = BM25 relevance score (default).

Use cases

Type-ahead / autocomplete (edge n-grams or completion suggester).
Full-text product / article search with relevance.
Log analytics — ELK / OpenSearch ingesting JSON logs.
Fuzzy matching (typos, "did you mean").
Aggregations: top-N, time-series buckets, percentiles.

Query patterns

POST /products/_search
{
  "query": {
    "bool": {
      "must":   [{ "match": { "title": "wireless headphones" } }],
      "filter": [
        { "term":  { "category": "audio" } },
        { "range": { "price": { "lte": 200 } } }
      ],
      "should": [
        { "match": { "brand": "sony" } }   // boost
      ]
    }
  },
  "aggs": {
    "by_brand": { "terms": { "field": "brand.keyword" } },
    "price_p":  { "percentiles": { "field": "price" } }
  },
  "size": 20,
  "from": 0
}

Field mapping rules

Need	Mapping
Full-text search	`"type": "text"` with analyzer
Exact match / sort / aggregate	`"type": "keyword"`
Range queries	`integer`, `date`, `double`
Both above (common)	`text` with `fields.keyword` multi-field
Geo	`geo_point` for lat/lon

Tuning

Define explicit mappings up-front — dynamic mapping creates fields that explode index size.
Use filter context (bool.filter) for yes/no — skips scoring, cacheable.
Shard count: hard to change after creation. Aim ~10–50 GB per shard.
Kibana for ad-hoc exploration; not for prod query path.

Gotcha: ES is near-real-time. Default refresh interval = 1s. Bulk-index then ?refresh=wait_for if you must read your write.

MODULE 15 — FAILURES

Error Handling

Errors are first-class output. Plan them.

Error categories

Type	When	Strategy
Syntax	Compile / parse time	Lint + CI catch.
Runtime — transient	Network blip, DB locked	Retry with backoff + circuit breaker.
Runtime — permanent	Bad input, missing record	Fail fast, return 4xx.
Logical / business	Insufficient funds, conflict	Domain error → 4xx with code.
System	Out of memory, disk full	Crash + restart + alert.

Strategies

Fail fast — invalid input rejected at edge; cheaper than half-applied work.
Fail safe — on unknown error, deny access (auth failures); for non-critical features, degrade.
Graceful degradation — recs unavailable? Show empty section, not 500.
Circuit breaker — after N failures to dependency, open circuit for cool-down. Prevents cascading.

Custom error types

class AppError(Exception):
    code:    str
    status:  int
    message: str
    cause:   Exception | None = None

class NotFound(AppError):
    status = 404
    code   = "NOT_FOUND"

class RateLimited(AppError):
    status = 429
    code   = "RATE_LIMITED"

Global handler

@app.errorhandler(Exception)
def handle(e):
    request_id = g.get("request_id")
    if isinstance(e, AppError):
        log.warn({"code": e.code, "rid": request_id})
        return jsonify(error=e.code, message=e.message), e.status
    log.error({"rid": request_id}, exc_info=True)
    return jsonify(error="INTERNAL", message="something went wrong"), 500

User-facing error response

{
  "error":     "INSUFFICIENT_FUNDS",
  "message":   "Balance $20.00 below $50.00 needed.",
  "requestId": "7f3a-1c2b",
  "docsUrl":   "https://x.com/docs/errors#insufficient_funds"
}

Monitoring & alerting

Sentry / Bugsnag / Rollbar — exception capture with stack + breadcrumbs.
ELK / Loki / Datadog — log aggregation, search by request-id.
PagerDuty / Opsgenie — paging for SLO violations.
Alert on symptoms (latency, error rate) not causes (CPU). Causes are runbook context.

MODULE 16 — CONFIG

Config Management

Separate config from code. Environment-aware. Secrets isolated.

Config types

Type	Examples	Where
Static	retry counts, page size, timeouts	YAML / JSON in repo
Environment-specific	DB URL, Redis host, log level	env vars / per-env file
Sensitive	API keys, signing secrets, DB creds	secret manager (Vault, AWS SM, GCP SM, sops)
Dynamic	feature flags, kill switches	LaunchDarkly, Unleash, Flagsmith, ConfigCat

Precedence (12-factor)

defaults (in code)
  ↓ overridden by
config file (config.yaml)
  ↓ overridden by
env vars (DATABASE_URL=...)
  ↓ overridden by
command-line flags (--port 8080)

.env workflow

# .env (NEVER commit)
DATABASE_URL=postgres://app:secret@localhost/app
JWT_SECRET=hunter2

# .env.example (commit this)
DATABASE_URL=
JWT_SECRET=

# loading
load_dotenv()
db_url = os.environ["DATABASE_URL"]    # crash fast on missing
log_level = os.environ.get("LOG_LEVEL", "info")

Feature flags

if flags.enabled("new-checkout", user_id=user.id):
    return new_checkout_flow(order)
return legacy_checkout_flow(order)

# rollout patterns:
#   percentage:    10% of users
#   targeting:     users in cohort "beta"
#   kill switch:   instantly disable broken feature without redeploy

Secret rotation

Read secret at startup; cache. Reload on SIGHUP or scheduled refresh.
Support two valid versions during rotation (overlap window).
Audit-log every secret access.

Never: hardcode prod secrets; commit .env; print env in logs; ship secrets in container images.

MODULE 17 — OBSERVABILITY

Logging, Monitoring, Tracing

Three pillars: logs (events), metrics (aggregates), traces (causal chains).

Logs

Levels

Level	Use	Alert?
DEBUG	Dev troubleshooting	No
INFO	Lifecycle: startup, shutdown, important business events	No
WARN	Recoverable, degraded	Trend
ERROR	Failed request, exception	Yes if rate spikes
FATAL	Process can't continue	Page

Structured logging

# DON'T
log.info(f"user {uid} placed order {oid} for ${amt}")

# DO  — JSON keys are queryable
log.info("order_placed", extra={
  "user_id":   uid,
  "order_id":  oid,
  "amount":    amt,
  "request_id": rid,
  "trace_id":  tid,
})

What NOT to log

Passwords, tokens, full credit card numbers, PII without need.
Full request body on auth routes (passwords in body).
Stack traces to user-facing logs (info leak).

Rotation & retention

Rotate by size or daily; compress; ship to central store.
Retention by class: access logs 30d, audit 1–7y depending on compliance.

Metrics

Type	Use	Example
Counter	Monotonic count	`http_requests_total{route="/users",code="200"}`
Gauge	Point-in-time value	`db_pool_in_use`, `queue_depth`
Histogram	Distribution	`http_duration_seconds_bucket`
Summary	Pre-computed quantiles	p50 / p95 / p99 latency

RED / USE / Four Golden Signals

RED (request-oriented): Rate, Errors, Duration.
USE (resource-oriented): Utilization, Saturation, Errors.
Four golden signals: latency, traffic, errors, saturation.

Tracing

trace_id: 4f3a... (one per request, propagated across services)
  ├─ span A: api-gateway      (1.2 ms)
  ├─ span B: auth-service     (3.0 ms)
  └─ span C: user-service     (15.4 ms)
       ├─ span D: postgres    (8.1 ms)
       └─ span E: redis       (0.3 ms)

Use OpenTelemetry SDK to emit spans; export to Jaeger, Tempo, Honeycomb.
W3C traceparent header propagates across HTTP / gRPC / queues.
Sample (head/tail) — 100% tracing too expensive; tail-based samples errors + slow tail.

MODULE 18 — SHUTDOWN

Graceful Shutdown

Stop without dropping in-flight work.

Signals

Signal	Number	Catchable	Sender
SIGTERM	15	✓	k8s/systemd normal stop
SIGINT	2	✓	Ctrl-C
SIGHUP	1	✓	Reload config (convention)
SIGKILL	9	✗	Force kill — no chance to clean up

Shutdown sequence

Mark unhealthy — readiness probe returns 503. LB stops sending new traffic.
Drain — keep serving in-flight requests; reject new ones (or 503).
Wait grace period — typically 10–30s.
Close external resources — DB pool drain, flush log buffers, close file handles, ack pending queue messages.
Exit 0.

Pattern (Go)

srv := &http.Server{Addr: ":8080", Handler: mux}
go srv.ListenAndServe()

sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
<-sigCh

ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
srv.Shutdown(ctx)        // stop accepting, finish in-flight
db.Close()
log.Sync()
os.Exit(0)

Kubernetes specifics

K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL.
Add preStop hook to sleep few seconds — gives LB time to remove pod before SIGTERM.
Pod removal from Service endpoints is eventually consistent — that sleep helps.

MODULE 19 — PERFORMANCE

Scaling, Performance, Concurrency

Find bottleneck → fix smallest → measure → repeat.

Find bottleneck

Measure first. Profile (flame graph) before optimizing.
Response time breakdown: wait queue + compute + DB + downstream + serialization.
USE method: utilization, saturation, errors per resource (CPU, RAM, disk, net).
Top tools: pprof (Go), cProfile/py-spy (Python), Chrome DevTools (Node), async-profiler (Java).

DB optimization

N+1 — fetching N children individually after listing parents. Fix: eager-load / JOIN / dataloader pattern.
Indexes for read-heavy paths; benchmark EXPLAIN ANALYZE on real data shapes.
Batching — replace per-item INSERTs with bulk insert; reduce round-trips.
Read replicas — fan reads off primary; mind replication lag.
Sharding / partitioning when single node hits write ceiling.

App-level

Compress payloads (gzip/br) — usually 5–10× smaller on JSON.
Close file handles / connections — defer/finally/using/context-manager.
Avoid loading whole files in memory — stream.
Cache expensive computations (memoize). Beware staleness.
Graceful degradation under load — shed non-critical features (recs, analytics).

Concurrency vs parallelism

	Concurrency	Parallelism
What	Multiple tasks interleaved on one core	Multiple tasks on multiple cores
Wins on	I/O-bound (DB, HTTP, file)	CPU-bound (encoding, math)
Primitives	async/await, goroutines, threads	process pools, worker threads
Python	`asyncio`, `aiohttp`	`multiprocessing` (GIL blocks CPU threads)
Node	event loop (default)	worker_threads, cluster
Go	goroutines	GOMAXPROCS = #CPUs

Scaling axes

Vertical — bigger box. Cheap until ceiling.
Horizontal — more boxes behind LB. Requires stateless app.
Functional — split monolith into services by domain.
Data — read replicas, sharding, CQRS, event sourcing.

MODULE 20 — INTEGRATIONS

Advanced Integrations

Big files, real-time, push patterns.

Object storage (S3 et al.)

Direct upload via pre-signed URLs — client uploads to S3 directly. Server never sees bytes.
Multipart upload for files > 100 MB: split into ≥ 5 MB parts; parallel; resumable.
Streaming — pipe S3 → response without buffering full file in memory.
Lifecycle: transition to IA/Glacier; expire old objects automatically.
Versioning + MFA delete for compliance buckets.

# pre-signed PUT
url = s3.generate_presigned_url(
  "put_object",
  Params={"Bucket": "uploads", "Key": f"users/{uid}/{uuid}.jpg",
          "ContentType": "image/jpeg"},
  ExpiresIn=900,
)
# client then: PUT url -H "Content-Type: image/jpeg" --data-binary @file

Real-time

	WebSockets	Server-Sent Events (SSE)	Long polling
Direction	Bidirectional	Server → client	Client polls; server holds
Transport	WS upgrade over TCP	HTTP keep-alive + `text/event-stream`	HTTP
Reconnect	App-level	Built-in (`Last-Event-ID`)	Per-poll
Use	Chat, games, collab	Notifications, dashboards	Legacy/fallback

Pub/Sub architecture

Producers publish events to topics; subscribers consume independently.
Decouples services — publisher doesn't know who listens.
Brokers: Redis Pub/Sub (fire-and-forget), Kafka (replayable), GCP Pub/Sub, NATS.

Webhooks (server-initiated)

	Polling	Webhook
Initiator	Consumer	Producer
Latency	Interval	Near-instant
Cost	Wasted polls	Pay per event
Reliability	Easy (idempotent reads)	Hard (retries, DLQ, signatures)

Outbound webhook checklist

HTTPS only. Sign payload (HMAC-SHA256) — receiver verifies.
Include unique event ID + timestamp (replay protection).
Retry with exponential backoff (e.g., 1m, 5m, 30m, 6h, 24h); DLQ after.
Expose dashboard so customers can see deliveries, replay manually.
Test locally with ngrok / cloudflared tunnels.

# signing
sig = hmac.new(secret, body, sha256).hexdigest()
headers = {
  "X-Webhook-Signature": f"sha256={sig}",
  "X-Webhook-Timestamp": ts,
  "X-Webhook-Id": event_id,
}

# verifying — constant-time compare!
expected = "sha256=" + hmac.new(secret, body, sha256).hexdigest()
hmac.compare_digest(expected, headers["X-Webhook-Signature"])

MODULE 21 — CONTRACT

OpenAPI Standards

Spec-first APIs: describe → generate clients/servers/docs/tests.

Why API-first

Single source of truth — frontend, backend, partners all consume same spec.
Parallel work — UI mocks against spec while server is built.
Auto-generated clients (TS, Python, Java, Go) — no hand-written HTTP.
Auto-generated server stubs and request validators.
Diffable in code review — breaking changes are visible.

Evolution

Swagger 2.0 (2014) → OpenAPI 3.0 (2017) → 3.1 (2021, aligned with JSON Schema 2020-12).
Tools: Swagger UI, Redoc (docs), Postman/Insomnia (test), oapi-codegen / openapi-typescript / openapi-python-client (codegen).

Document anatomy

openapi: 3.1.0
info:
  title:   Orders API
  version: 1.4.0
servers:
  - url: https://api.x.com/v1

paths:
  /orders/{id}:
    get:
      operationId: getOrder
      parameters:
        - name: id
          in: path
          required: true
          schema: { type: string, format: uuid }
      responses:
        '200':
          description: OK
          content:
            application/json:
              schema: { $ref: '#/components/schemas/Order' }
        '404': { $ref: '#/components/responses/NotFound' }
      security:
        - bearerAuth: []

components:
  schemas:
    Order:
      type: object
      required: [id, status, total]
      properties:
        id:     { type: string, format: uuid }
        status: { type: string, enum: [pending, paid, shipped] }
        total:  { type: number, minimum: 0 }
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: JWT

Best practices

Keep spec in repo next to code. Lint with Spectral.
CI step: regenerate clients on every change; fail PR if drift.
Use operationId consistently — codegen names methods from it.
Define error schemas once in components/responses; reference from each endpoint.
Version via URL segment (/v1); bump on breaking change.

MODULE 22 — DELIVERY

Testing, Code Quality, DevOps

Ship safely. Verify automatically. Operate continuously.

Test types & pyramid

Type	Scope	Speed	Quantity
Unit	Pure function / class	ms	Many (base of pyramid)
Integration	Module + real DB / queue	sec	Some
Contract	Service boundary (consumer-driven)	sec	Per-pair
End-to-end	Full user flow through UI	min	Few (top of pyramid)
Load / stress	System under traffic	min–hr	Pre-release
UAT	Real users on staging	days	Per release
Security	SAST, DAST, dep scan, pentest	—	Continuous + pre-release

TDD cycle

Red — write failing test for desired behavior.
Green — minimum code to pass.
Refactor — clean up; tests still pass.

CI/CD pipeline shape

push / PR ─► lint ─► unit ─► build ─► integration ─► sec-scan ─► sign image ─► deploy to staging ─► smoke tests ─► (manual approval?) ─► deploy to prod ─► rollout watch (auto-rollback on SLO breach)

Code quality

Lint — eslint, ruff, golangci-lint, rubocop. Run pre-commit + CI.
Format — prettier, black, gofmt. Don't argue style; let tools settle it.
Cyclomatic complexity — keep functions < 10–15; split when over.
Coverage — useful as floor (e.g., 70%), useless as gospel.
Mutation testing — flips operators; checks tests actually catch.

12-Factor App

Codebase: one app, one repo, many deploys.
Dependencies: explicit, isolated (lockfile).
Config: in environment, never in code.
Backing services: treat as attached resources (DB, queue swap by URL).
Build → Release → Run: strict separation.
Processes: stateless, share-nothing.
Port binding: app exports HTTP itself.
Concurrency: scale via process model.
Disposability: fast startup, graceful shutdown.
Dev/prod parity: same OS, services, data shape.
Logs: stream to stdout; let platform aggregate.
Admin processes: one-off scripts run in same env.

DevOps stack

Layer	Tools
IaC	Terraform, Pulumi, CDK, CloudFormation
Containers	Docker / OCI, BuildKit, buildpacks
Orchestration	Kubernetes, ECS, Nomad
CI/CD	GitHub Actions, GitLab CI, ArgoCD, Flux
Secrets	Vault, AWS SM, sops, sealed-secrets
Observability	Prometheus, Grafana, Loki, Jaeger, Datadog

Deployment strategies

Strategy	How	Rollback	Risk
Recreate	Stop v1, start v2	Stop v2, start v1	Downtime — small apps only.
Rolling	Replace pods batch-by-batch	Roll back same way	Mixed versions during rollout.
Blue/Green	Stand up v2 fully; flip LB	Flip LB back to v1	2× capacity briefly.
Canary	v2 to 1%/5%/25%/100%	Halt rollout, drain canary	Need automated metrics gate.
Feature flag	Deploy dark; toggle on for cohort	Toggle off — no redeploy	Flag-debt if not cleaned up.

Horizontal vs vertical scaling

Vertical — bigger instance. Fast win, hits ceiling, single point of failure.
Horizontal — more instances + LB. Requires stateless app + shared DB/cache. Near-linear scaling until DB becomes bottleneck.