SLI

A Service Level Indicator (SLI) is a measurement of how well a service is performing on a given dimension, such as availability, latency, throughput, or correctness. SLIs are the raw observables; SLOs are targets set on top of them.

Common SLI types

  • Availability: ratio of successful requests to total requests over a window.
  • Latency: proportion of requests served below a threshold (for example, p95 under 200 ms).
  • Throughput: events processed per second; often used for batch and stream pipelines.
  • Quality / correctness: proportion of responses that meet a freshness, completeness, or accuracy check.
  • Durability: for storage systems, the probability that data is not lost.

Designing a good SLI

  • Measured at a point that reflects the user experience, typically the request boundary closest to the user.
  • Defined as a ratio with a clear numerator and denominator, so the SLO becomes a single percentage.
  • Resilient to harmless fluctuations: a single 5xx among 1 million requests should not invalidate the indicator.
🔗

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon