SRE Principles
Reliability as engineering discipline. Error budgets quantify trade-offs.
Core Tenets
- Embrace risk — 100% reliability is wrong target. Set explicit budget and spend it.
- Service Level Objectives drive engineering priority. Below SLO → freeze features, fix reliability. Above → ship faster.
- Eliminate toil — repetitive manual ops work. Cap toil at <50% per SRE.
- Automate everything — runbooks → code.
- Blameless postmortems — focus on systemic causes, not individual blame.
- Shared ownership — devs run what they build; SREs partner, not absorb.
Error Budget Math
SLO = 99.9% availability over 30 days
Budget = (1 - 0.999) × 30d = 0.001 × 43,200 min = 43.2 min/month
Burn rate = (errors observed) / (budget per same interval)
Burn rate 10× over 1 hour = consumed 10 hr of budget in 1 hr
Alert threshold typical: 14.4× over 1 hr = 2% budget burned
Slow burn: 1× over 6 hr = 5% burned