Prometheus

Prometheus is an open-source time-series database and monitoring system that scrapes metrics from instrumented applications over HTTP, stores them locally, and exposes a powerful query language (PromQL) for dashboards and alerts. It is the de facto standard for cloud-native infrastructure monitoring.

How it works

Applications expose a /metrics endpoint in the Prometheus exposition format. A Prometheus server is configured with scrape targets; on each scrape interval (typically 15 to 60 seconds), it pulls the metrics, applies relabelling, and stores them in its local TSDB. Service discovery (Kubernetes, Consul, EC2, file_sd) keeps the target list current.

PromQL

PromQL is the query language for selecting time series, applying functions, and computing aggregations. Common patterns:

  • rate(http_requests_total[5m]) - requests per second over the last 5 minutes
  • histogram_quantile(0.99, rate(latency_bucket[5m])) - p99 latency
  • sum by (route) (rate(http_requests_total[1m])) - per-route request rate

Ecosystem

  • Alertmanager. Receives alerts from Prometheus, deduplicates, groups, routes to Slack, PagerDuty, email.
  • Grafana. The standard visualisation layer on top of Prometheus.
  • Exporters. Adapters that expose metrics from third-party systems (node_exporter, blackbox_exporter, mysqld_exporter).
  • Long-term storage. Thanos, Cortex, Mimir, VictoriaMetrics for horizontal scaling and multi-cluster.

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon