SLO
A Service Level Objective (SLO) is an internal target for a service's reliability, expressed as a percentage over a window (for example, 99.9% of requests succeed within 300 ms over a rolling 28-day window). SLOs anchor the practice of Site Reliability Engineering by making reliability concrete, measurable, and tradable against feature velocity.
Related terms in the SLI/SLO/SLA family
- SLI (Service Level Indicator). The actual measurement, for example the ratio of successful requests to total requests over a window.
- SLO (Service Level Objective). The target the SLI must hit, set by the team responsible for the service.
- SLA (Service Level Agreement). A contractual commitment to customers, usually weaker than the internal SLO so there is a safety margin.
Error budgets
The complement of an SLO is the error budget: the amount of failure permitted by the SLO. A 99.9% availability SLO leaves a 0.1% error budget, roughly 43 minutes per month. The error budget is a quota: while it has room, the team can ship faster and take more risk; once depleted, the team prioritises reliability work over features. This frames reliability as a continuous business decision rather than a binary up/down state.
Common tooling
- Nobl9, Cortex SLO, Sloth (open source), Datadog SLOs, New Relic SLM, Grafana SLO