By Sahil Kapoor - 30 Aug 2025

Distributed Tracing

Distributed tracing captures the path of a single request as it travels through multiple services, recording the timing, parent-child relationships, and metadata of each step. A trace gives operators a flame-graph view of where latency comes from and how services depend on each other.

Core concepts

Trace. The full record of one request across all services it touches, identified by a trace_id.
Span. One unit of work within a trace (an HTTP call, a database query, a function execution). Each span has a start time, end time, name, and attributes.
Parent and child spans. Spans form a tree, recording which call invoked which.
Context propagation. Trace and span IDs are passed across service boundaries in HTTP headers (W3C traceparent) or message metadata, so spans from different services join the same trace.

Sampling

Capturing every trace at scale is expensive. Common sampling strategies:

Head-based sampling. Decide at the entry point whether to record this trace.
Tail-based sampling. Buffer spans and decide after the trace completes (keep all errors, slow traces, and a fraction of fast traces).

Common tools

Open source: Jaeger, Tempo, Zipkin, OpenTelemetry Collector
Commercial: Datadog APM, Honeycomb, New Relic, Lightstep (ServiceNow), Splunk APM

🔗

Related Terms
Observability, OpenTelemetry, Metrics, Logging

📖

Further Reading
What is API Observability? Logs, Metrics, Traces Explained

Core concepts

Sampling

Common tools

Subscribe to Sahil's Playbook