Observability

Observability is the ability to understand a system's internal state from its outputs - logs, metrics, and traces - without new code.

Observability is the property of a system that lets you ask arbitrary diagnostic questions from the outside - without adding new instrumentation after the fact. If you can determine why a system is misbehaving solely from the data it already emits, the system is observable.

The term originates from control theory. In distributed systems engineering, it is operationalized through three signal types: logs (timestamped event records), metrics (numeric measurements over time), and traces (the path of a single request through multiple services).

Why it matters in Engineering: Monitoring tells you something is wrong. Observability tells you why. As systems grow more distributed, debugging becomes harder - a slow API call could be caused by a database query, a downstream service, a memory leak, or a bad deploy. Without traces and structured logs, you are guessing. Good observability shortens incident resolution from hours to minutes. Investing in it before a production incident is significantly cheaper than retrofitting it after one.

The Three Pillars

Logs
Timestamped records of discrete events. Structured logs (JSON) are far more useful than plain text because they support querying, filtering, and aggregation. Avoid log verbosity that drowns signal in noise.

Metrics
Numeric measurements sampled over time: request rate, error rate, latency percentiles, memory usage. Prometheus is the standard collection and storage tool. Metrics are cheap to store and fast to query.

Traces
A trace follows one request as it moves through multiple services, recording timing and metadata at each hop. OpenTelemetry is the standard instrumentation library. Traces are essential for debugging latency in microservices architectures.

SLO / SLI
Service Level Indicators measure what "working" looks like (e.g., 99th percentile latency). Service Level Objectives define the target. Observability tooling is what tells you whether you are meeting them.

Relationship to CI/CD
Good observability makes deployments safer. Automated canary analysis and rollback decisions rely on the same metrics and traces produced by your observability stack.

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon