Distributed Tracing
Distributed tracing captures the path of a single request as it travels through multiple services, recording the timing, parent-child relationships, and metadata of each step. A trace gives operators a flame-graph view of where latency comes from and how services depend on each other.
Core concepts
- Trace. The full record of one request across all services it touches, identified by a
trace_id. - Span. One unit of work within a trace (an HTTP call, a database query, a function execution). Each span has a start time, end time, name, and attributes.
- Parent and child spans. Spans form a tree, recording which call invoked which.
- Context propagation. Trace and span IDs are passed across service boundaries in HTTP headers (W3C
traceparent) or message metadata, so spans from different services join the same trace.
Sampling
Capturing every trace at scale is expensive. Common sampling strategies:
- Head-based sampling. Decide at the entry point whether to record this trace.
- Tail-based sampling. Buffer spans and decide after the trace completes (keep all errors, slow traces, and a fraction of fast traces).
Common tools
- Open source: Jaeger, Tempo, Zipkin, OpenTelemetry Collector
- Commercial: Datadog APM, Honeycomb, New Relic, Lightstep (ServiceNow), Splunk APM
🔗
📖
Further Reading
What is API Observability? Logs, Metrics, Traces Explained
What is API Observability? Logs, Metrics, Traces Explained