SLO

A Service Level Objective (SLO) is an internal target for a service's reliability, expressed as a percentage over a window (for example, 99.9% of requests succeed within 300 ms over a rolling 28-day window). SLOs anchor the practice of Site Reliability Engineering by making reliability concrete, measurable, and tradable against feature velocity.

  • SLI (Service Level Indicator). The actual measurement, for example the ratio of successful requests to total requests over a window.
  • SLO (Service Level Objective). The target the SLI must hit, set by the team responsible for the service.
  • SLA (Service Level Agreement). A contractual commitment to customers, usually weaker than the internal SLO so there is a safety margin.

Error budgets

The complement of an SLO is the error budget: the amount of failure permitted by the SLO. A 99.9% availability SLO leaves a 0.1% error budget, roughly 43 minutes per month. The error budget is a quota: while it has room, the team can ship faster and take more risk; once depleted, the team prioritises reliability work over features. This frames reliability as a continuous business decision rather than a binary up/down state.

Common tooling

  • Nobl9, Cortex SLO, Sloth (open source), Datadog SLOs, New Relic SLM, Grafana SLO
🔗

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon