Observability

Observability is the property of a system whose internal state can be inferred from its external outputs. In practice, it refers to the discipline of instrumenting applications and infrastructure so operators can answer questions about system behaviour, including questions not anticipated when the instrumentation was added.

How it works

Observability is typically built on three signal types, often called the three pillars:

  • Logs. Structured, timestamped records of discrete events. Used for forensic detail.
  • Metrics. Numerical measurements aggregated over time. Used for trends, alerting, and dashboards.
  • Traces. Records of a request's path through multiple services. Used for understanding latency and dependency.

Modern practice adds events, profiles, and continuous profiling as additional signal types. OpenTelemetry is the standard instrumentation API spanning all three pillars.

Common tools

  • Open source: Prometheus, Grafana, Loki, Tempo, Jaeger, OpenTelemetry Collector
  • Commercial: Datadog, New Relic, Honeycomb, Splunk, Dynatrace
  • Cloud-native: AWS CloudWatch, Google Cloud Operations, Azure Monitor

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon