MODULE 1

SRE Principles

Reliability as engineering discipline. Error budgets quantify trade-offs.

Core Tenets

  • Embrace risk — 100% reliability is wrong target. Set explicit budget and spend it.
  • Service Level Objectives drive engineering priority. Below SLO → freeze features, fix reliability. Above → ship faster.
  • Eliminate toil — repetitive manual ops work. Cap toil at <50% per SRE.
  • Automate everything — runbooks → code.
  • Blameless postmortems — focus on systemic causes, not individual blame.
  • Shared ownership — devs run what they build; SREs partner, not absorb.

Error Budget Math

SLO = 99.9% availability over 30 days
Budget = (1 - 0.999) × 30d = 0.001 × 43,200 min = 43.2 min/month

Burn rate = (errors observed) / (budget per same interval)
  Burn rate 10× over 1 hour = consumed 10 hr of budget in 1 hr
  Alert threshold typical: 14.4× over 1 hr = 2% budget burned
  Slow burn: 1× over 6 hr = 5% burned
MODULE 2

SLI / SLO / SLA

Measure → target → contract.

Definitions

TermDefinitionExample
SLIIndicator: measured ratio of good events / valid eventsp99 latency, success rate, freshness
SLOTarget on SLI over windowp99 < 300 ms over 28 days, 99.9%
SLACustomer contract; financial penalty if breached99.9% uptime or 10% credit

Common SLI Types

  • Availability: successful_requests / valid_requests.
  • Latency: requests_under_threshold / valid_requests. Use p50 / p95 / p99 — never average.
  • Quality: graceful degradation rate.
  • Freshness: data age < X for batch / streaming pipelines.
  • Correctness: consistency check pass rate.
  • Throughput: requests/sec served vs offered.

Window Choice

Rolling 28-day or 30-day window standard. Monthly resets create gaming. Multi-window multi-burn-rate alerts: page on fast burn (1 hr × 14.4×), ticket on slow burn (6 hr × 6×).

MODULE 3

CI/CD Pipelines

Build → test → package → deploy → verify.

Pipeline Stages

  1. Source — webhook on push/PR. Branch protection: required reviews, signed commits, status checks.
  2. Build — compile, lint, type-check. Hermetic, reproducible. Cache deps.
  3. Test — unit (fast), integration (deps), contract, e2e (slow).
  4. Static analysis — SAST, dep scan (Snyk, Dependabot), license check.
  5. Package — container image, sign (cosign), SBOM.
  6. Deploy staging — smoke tests, integration suite.
  7. Deploy prod — progressive rollout.
  8. Verify — synthetic checks, error rate, latency.
  9. Rollback — automated on SLO breach.

GitHub Actions Pattern

name: ci
on:
  pull_request:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22', cache: true }
      - run: go vet ./...
      - run: go test -race -cover ./...

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/org/app:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

GitOps

Git as single source of truth for desired infra state. Argo CD / Flux watches repo, reconciles cluster. Benefits: auditable, rollback = revert, no kubectl access for humans in prod.

MODULE 4

Deployment Patterns

How new code reaches users without taking down service.

Patterns

PatternHowRisk profile
RecreateStop old, start newDowntime; OK for dev only
RollingReplace pods N at a timeMixed versions during rollout
Blue/GreenTwo full envs, swap LBInstant cutover; 2× capacity cost
CanaryRoute 1–5% to new, increaseLimited blast radius; needs metrics
Shadow / mirrorReplay prod traffic to new without servingValidate without user impact
Feature flagDeploy dark, toggle per user/cohortBest for product changes; flag debt risk

Canary Recipe

  • Promote on success metric: error rate < baseline + 0.1%, p99 < baseline × 1.1.
  • Steps: 1% → 5% → 25% → 50% → 100%, 10 min between.
  • Auto-rollback on regression. Don't proceed if statistical signal weak (low traffic).
  • Tools: Argo Rollouts, Flagger, LaunchDarkly + ingress weights.

Feature Flag Hygiene

  • Every flag has owner + expiry date.
  • Long-lived (kill switches, experiments) tagged separately.
  • Quarterly cleanup sprint to remove dead flags.
  • Avoid nested flag combinations — exponential test matrix.
MODULE 5

Observability

Three pillars: metrics, logs, traces. Plus profiling and events.

Pillars

PillarCardinalityQuestion answeredTools
MetricsLow (aggregated)Is it broken? How much?Prometheus, Datadog, CloudWatch
LogsHigh (per event)What happened on this request?Loki, ES, Splunk, CloudWatch Logs
TracesPer request, span treeWhere is the latency?OpenTelemetry, Jaeger, Tempo, X-Ray
ProfilesContinuous CPU/heapWhy is this slow / fat?Pyroscope, Parca, Datadog

RED & USE

  • RED for services: Rate, Errors, Duration. Per endpoint.
  • USE for resources: Utilization, Saturation, Errors. Per CPU/disk/net/memory.
  • Four Golden Signals (Google SRE): latency, traffic, errors, saturation.

Cardinality Trap

Distributed Tracing

  • Trace = tree of spans across services. Span = single operation with start, end, tags.
  • Context propagation via W3C traceparent header.
  • Head-based sampling (decide at ingress) vs tail-based (decide after seeing full trace; better for errors but stateful).
  • OpenTelemetry = vendor-neutral SDK + protocol (OTLP).

Logging Discipline

  • Structured (JSON / logfmt), not free text.
  • Levels: ERROR (page-worthy), WARN (anomaly), INFO (lifecycle), DEBUG (dev).
  • Include trace_id, request_id, user_id, tenant_id.
  • Sample DEBUG/INFO at high RPS; never sample ERROR.
  • Never log secrets, PII unless redacted.
MODULE 6

Incident Response

Detect → mitigate → resolve → learn.

Severity Levels

SevDefinitionResponse
SEV-1Major outage; revenue / user impactAll-hands, exec page, war room
SEV-2Significant degradation; partial impactOn-call IC + responders
SEV-3Minor; no user impact yetInvestigate during business hours
SEV-4Cosmetic / future riskBacklog ticket

ICS Roles

  • Incident Commander (IC) — coordinates; doesn't fix.
  • Operations Lead — drives mitigation.
  • Communications Lead — internal + customer comms.
  • Scribe — timeline, decisions log.
  • Subject-matter experts — pulled in as needed.

Blameless Postmortem

  • Timeline (UTC, exact times, who did what).
  • Impact: users, requests, $$, duration.
  • Root cause + contributing factors (5 Whys).
  • What went well + what went poorly.
  • Action items: owner + due date. Track to completion.
  • Detection delay analysis — why didn't we catch sooner?

On-Call Health

  • Page volume target: < 2 per shift. More = alert tuning needed.
  • Compensate on-call (time off, stipend).
  • Rotate week-on / N-off. Primary + secondary.
  • Runbook per alert. Link in alert payload.
MODULE 7

Capacity & Load Testing

Provision for peak + headroom; verify with traffic.

Capacity Planning

  • Baseline = current peak. Plan = baseline × growth × seasonality × headroom (typically 30–50%).
  • Failure scenarios: lose 1 AZ → remaining must absorb traffic. Plan N+1 / N+2.
  • Track per resource: CPU, memory, network, disk IOPS, connection count, queue depth.

Load Test Types

TypeGoal
SmokeSystem works at all under load
LoadPerformance at expected peak
StressFind breaking point
SpikeSudden traffic surge
SoakStability over hours/days; memory leaks

Tools: k6, Locust, Vegeta, JMeter, Gatling.

Little's Law

L = λ × W
  L = average concurrent requests in system
  λ = arrival rate (req/s)
  W = average response time (s)

example: 1000 req/s × 200 ms = 200 concurrent. Need ≥200 worker slots
  buffer 30% headroom → 260 workers minimum
MODULE 8

Chaos Engineering

Inject failure in prod-like to find weakness before users do.

Principles

  1. Define steady-state (the metric that says "system is fine").
  2. Hypothesize steady-state holds in both control + experiment.
  3. Inject real-world events (instance kill, network latency, disk full).
  4. Try to disprove the hypothesis. Evidence beats opinion.
  5. Minimize blast radius. Have abort.

Common Experiments

  • Terminate random pod / VM / AZ (Chaos Monkey, AWS FIS, Litmus, Gremlin).
  • Inject latency / packet loss between services.
  • Drop dependency (database, Redis, downstream service).
  • Fill disk, exhaust file descriptors.
  • Clock skew, DNS failure.
  • GameDay: scheduled, supervised, organization-wide drill.
MODULE 9

Build & Release Engineering

Reproducibility, supply chain, artifact lifecycle.

Supply Chain

  • SBOM (software bill of materials) — CycloneDX, SPDX. Generate per build.
  • Sign artifacts (cosign, Sigstore). Verify at admission.
  • SLSA levels (1–4) describe build integrity. Aim for L3.
  • Pin transitive deps. Lockfiles checked in. Renovate / Dependabot.
  • Provenance attestations linking artifact → build → source commit.

Versioning

  • SemVer (MAJOR.MINOR.PATCH) for libraries.
  • CalVer (2026.05.08) or commit-SHA for services.
  • Immutable artifacts — never re-tag.
  • Promote artifact across envs, don't rebuild.
MODULE 10

Cheat Sheet

Reliability checklist for design reviews.

SLO Setup

  • 1–3 SLIs per service
  • 28-day rolling window
  • Multi-burn-rate alerts (1h / 6h)
  • Customer-aligned: latency from edge, not internal hop
  • Error budget tracked weekly

Resilience Patterns

  • Timeouts everywhere (no infinite waits)
  • Retry with jitter + backoff
  • Circuit breaker on dependencies
  • Bulkhead — separate thread pools / pods per dep
  • Graceful degradation paths
  • Idempotency keys on writes

Pre-Prod Gates

  • Unit + integration tests pass
  • Coverage ≥ team threshold
  • SAST + dep scan clean
  • Image signed + SBOM
  • Migration tested forward + back
  • Runbook updated

Alert Quality

  • Alert on symptoms (user impact), not causes
  • Every alert has runbook link
  • Alert is actionable (not "FYI")
  • Test fire alerts in non-prod monthly
  • Audit pages: false / actionable / wake-worthy?

Postmortem Template

  • Title + date + sev
  • Summary (3 lines)
  • Impact (users, $$, time)
  • Timeline (UTC)
  • Root cause + contributing factors
  • What went well / poorly
  • Action items: owner, due, tracking

Numbers

  • 99.9% = 43 min/mo budget
  • Headroom 30% baseline
  • Page volume < 2 / shift
  • Toil < 50% / SRE
  • Postmortem within 5 days
  • Action items closed < 30 days