Master Notebook: SRE & DevOps — FAANG Interview Guide

MODULE 1

SRE Principles

Reliability as engineering discipline. Error budgets quantify trade-offs.

Core Tenets

Embrace risk — 100% reliability is wrong target. Set explicit budget and spend it.
Service Level Objectives drive engineering priority. Below SLO → freeze features, fix reliability. Above → ship faster.
Eliminate toil — repetitive manual ops work. Cap toil at <50% per SRE.
Automate everything — runbooks → code.
Blameless postmortems — focus on systemic causes, not individual blame.
Shared ownership — devs run what they build; SREs partner, not absorb.

Error Budget Math

SLO = 99.9% availability over 30 days
Budget = (1 - 0.999) × 30d = 0.001 × 43,200 min = 43.2 min/month

Burn rate = (errors observed) / (budget per same interval)
  Burn rate 10× over 1 hour = consumed 10 hr of budget in 1 hr
  Alert threshold typical: 14.4× over 1 hr = 2% budget burned
  Slow burn: 1× over 6 hr = 5% burned

MODULE 2

SLI / SLO / SLA

Measure → target → contract.

Definitions

Term	Definition	Example
SLI	Indicator: measured ratio of good events / valid events	p99 latency, success rate, freshness
SLO	Target on SLI over window	p99 < 300 ms over 28 days, 99.9%
SLA	Customer contract; financial penalty if breached	99.9% uptime or 10% credit

Common SLI Types

Availability: successful_requests / valid_requests.
Latency: requests_under_threshold / valid_requests. Use p50 / p95 / p99 — never average.
Quality: graceful degradation rate.
Freshness: data age < X for batch / streaming pipelines.
Correctness: consistency check pass rate.
Throughput: requests/sec served vs offered.

Window Choice

Rolling 28-day or 30-day window standard. Monthly resets create gaming. Multi-window multi-burn-rate alerts: page on fast burn (1 hr × 14.4×), ticket on slow burn (6 hr × 6×).

MODULE 3

CI/CD Pipelines

Build → test → package → deploy → verify.

Pipeline Stages

Source — webhook on push/PR. Branch protection: required reviews, signed commits, status checks.
Build — compile, lint, type-check. Hermetic, reproducible. Cache deps.
Test — unit (fast), integration (deps), contract, e2e (slow).
Static analysis — SAST, dep scan (Snyk, Dependabot), license check.
Package — container image, sign (cosign), SBOM.
Deploy staging — smoke tests, integration suite.
Deploy prod — progressive rollout.
Verify — synthetic checks, error rate, latency.
Rollback — automated on SLO breach.

GitHub Actions Pattern

name: ci
on:
  pull_request:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22', cache: true }
      - run: go vet ./...
      - run: go test -race -cover ./...

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/org/app:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

GitOps

Git as single source of truth for desired infra state. Argo CD / Flux watches repo, reconciles cluster. Benefits: auditable, rollback = revert, no kubectl access for humans in prod.

MODULE 4

Deployment Patterns

How new code reaches users without taking down service.

Patterns

Pattern	How	Risk profile
Recreate	Stop old, start new	Downtime; OK for dev only
Rolling	Replace pods N at a time	Mixed versions during rollout
Blue/Green	Two full envs, swap LB	Instant cutover; 2× capacity cost
Canary	Route 1–5% to new, increase	Limited blast radius; needs metrics
Shadow / mirror	Replay prod traffic to new without serving	Validate without user impact
Feature flag	Deploy dark, toggle per user/cohort	Best for product changes; flag debt risk

Canary Recipe

Promote on success metric: error rate < baseline + 0.1%, p99 < baseline × 1.1.
Steps: 1% → 5% → 25% → 50% → 100%, 10 min between.
Auto-rollback on regression. Don't proceed if statistical signal weak (low traffic).
Tools: Argo Rollouts, Flagger, LaunchDarkly + ingress weights.

Feature Flag Hygiene

Every flag has owner + expiry date.
Long-lived (kill switches, experiments) tagged separately.
Quarterly cleanup sprint to remove dead flags.
Avoid nested flag combinations — exponential test matrix.

MODULE 5

Observability

Three pillars: metrics, logs, traces. Plus profiling and events.

Pillars

Pillar	Cardinality	Question answered	Tools
Metrics	Low (aggregated)	Is it broken? How much?	Prometheus, Datadog, CloudWatch
Logs	High (per event)	What happened on this request?	Loki, ES, Splunk, CloudWatch Logs
Traces	Per request, span tree	Where is the latency?	OpenTelemetry, Jaeger, Tempo, X-Ray
Profiles	Continuous CPU/heap	Why is this slow / fat?	Pyroscope, Parca, Datadog

RED & USE

RED for services: Rate, Errors, Duration. Per endpoint.
USE for resources: Utilization, Saturation, Errors. Per CPU/disk/net/memory.
Four Golden Signals (Google SRE): latency, traffic, errors, saturation.

Cardinality Trap

Distributed Tracing

Trace = tree of spans across services. Span = single operation with start, end, tags.
Context propagation via W3C traceparent header.
Head-based sampling (decide at ingress) vs tail-based (decide after seeing full trace; better for errors but stateful).
OpenTelemetry = vendor-neutral SDK + protocol (OTLP).

Logging Discipline

Structured (JSON / logfmt), not free text.
Levels: ERROR (page-worthy), WARN (anomaly), INFO (lifecycle), DEBUG (dev).
Include trace_id, request_id, user_id, tenant_id.
Sample DEBUG/INFO at high RPS; never sample ERROR.
Never log secrets, PII unless redacted.

MODULE 6

Incident Response

Detect → mitigate → resolve → learn.

Severity Levels

Sev	Definition	Response
SEV-1	Major outage; revenue / user impact	All-hands, exec page, war room
SEV-2	Significant degradation; partial impact	On-call IC + responders
SEV-3	Minor; no user impact yet	Investigate during business hours
SEV-4	Cosmetic / future risk	Backlog ticket

ICS Roles

Incident Commander (IC) — coordinates; doesn't fix.
Operations Lead — drives mitigation.
Communications Lead — internal + customer comms.
Scribe — timeline, decisions log.
Subject-matter experts — pulled in as needed.

Blameless Postmortem

Timeline (UTC, exact times, who did what).
Impact: users, requests, $$, duration.
Root cause + contributing factors (5 Whys).
What went well + what went poorly.
Action items: owner + due date. Track to completion.
Detection delay analysis — why didn't we catch sooner?

On-Call Health

Page volume target: < 2 per shift. More = alert tuning needed.
Compensate on-call (time off, stipend).
Rotate week-on / N-off. Primary + secondary.
Runbook per alert. Link in alert payload.

MODULE 7

Capacity & Load Testing

Provision for peak + headroom; verify with traffic.

Capacity Planning

Baseline = current peak. Plan = baseline × growth × seasonality × headroom (typically 30–50%).
Failure scenarios: lose 1 AZ → remaining must absorb traffic. Plan N+1 / N+2.
Track per resource: CPU, memory, network, disk IOPS, connection count, queue depth.

Load Test Types

Type	Goal
Smoke	System works at all under load
Load	Performance at expected peak
Stress	Find breaking point
Spike	Sudden traffic surge
Soak	Stability over hours/days; memory leaks

Tools: k6, Locust, Vegeta, JMeter, Gatling.

Little's Law

L = λ × W
  L = average concurrent requests in system
  λ = arrival rate (req/s)
  W = average response time (s)

example: 1000 req/s × 200 ms = 200 concurrent. Need ≥200 worker slots
  buffer 30% headroom → 260 workers minimum

MODULE 8

Chaos Engineering

Inject failure in prod-like to find weakness before users do.

Principles

Define steady-state (the metric that says "system is fine").
Hypothesize steady-state holds in both control + experiment.
Inject real-world events (instance kill, network latency, disk full).
Try to disprove the hypothesis. Evidence beats opinion.
Minimize blast radius. Have abort.

Common Experiments

Terminate random pod / VM / AZ (Chaos Monkey, AWS FIS, Litmus, Gremlin).
Inject latency / packet loss between services.
Drop dependency (database, Redis, downstream service).
Fill disk, exhaust file descriptors.
Clock skew, DNS failure.
GameDay: scheduled, supervised, organization-wide drill.

MODULE 9

Build & Release Engineering

Reproducibility, supply chain, artifact lifecycle.

Supply Chain

SBOM (software bill of materials) — CycloneDX, SPDX. Generate per build.
Sign artifacts (cosign, Sigstore). Verify at admission.
SLSA levels (1–4) describe build integrity. Aim for L3.
Pin transitive deps. Lockfiles checked in. Renovate / Dependabot.
Provenance attestations linking artifact → build → source commit.

Versioning

SemVer (MAJOR.MINOR.PATCH) for libraries.
CalVer (2026.05.08) or commit-SHA for services.
Immutable artifacts — never re-tag.
Promote artifact across envs, don't rebuild.

MODULE 10

Cheat Sheet

Reliability checklist for design reviews.

SLO Setup

1–3 SLIs per service
28-day rolling window
Multi-burn-rate alerts (1h / 6h)
Customer-aligned: latency from edge, not internal hop
Error budget tracked weekly

Resilience Patterns

Timeouts everywhere (no infinite waits)
Retry with jitter + backoff
Circuit breaker on dependencies
Bulkhead — separate thread pools / pods per dep
Graceful degradation paths
Idempotency keys on writes

Pre-Prod Gates

Unit + integration tests pass
Coverage ≥ team threshold
SAST + dep scan clean
Image signed + SBOM
Migration tested forward + back
Runbook updated

Alert Quality

Alert on symptoms (user impact), not causes
Every alert has runbook link
Alert is actionable (not "FYI")
Test fire alerts in non-prod monthly
Audit pages: false / actionable / wake-worthy?

Postmortem Template

Title + date + sev
Summary (3 lines)
Impact (users, $$, time)
Timeline (UTC)
Root cause + contributing factors
What went well / poorly
Action items: owner, due, tracking

Numbers

99.9% = 43 min/mo budget
Headroom 30% baseline
Page volume < 2 / shift
Toil < 50% / SRE
Postmortem within 5 days
Action items closed < 30 days