115 · METRICS · PROMETHEUS · GRAFANA

Metrics

Measure system behavior over time with counters, gauges, and histograms.

If you are new here: Metrics are numerical measurements of your system's behavior, collected at regular intervals and stored as time-series data. Where logs record individual events ("this payment failed at 09:00:01"), metrics record aggregated state ("the error rate for payments is 2.3% right now"). Metrics are the foundation of alerting and dashboards — they let you answer "is the system healthy right now?" in milliseconds. The standard open-source stack is Prometheus (collects and stores metrics) + Grafana (visualizes and alerts). Cloud platforms offer equivalents: CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor. Understanding three metric types — counters, gauges, and histograms — and four golden signals tells you almost everything you need to know about instrumenting a production service.

TermPlain meaning
MetricA named, numerical measurement with attached labels — http_requests_total{service="payment", status="500"}
Time-seriesA sequence of (timestamp, value) pairs for one metric + label combination
CounterA metric that only increases — total requests served, total errors
GaugeA metric that can go up or down — current memory use, queue depth
HistogramRecords the distribution of values across buckets — enables latency percentiles
LabelA key-value dimension on a metric — {service="payment", region="eu-west"}
CardinalityThe number of unique label value combinations — high cardinality destroys performance
PrometheusThe de-facto open-source metrics system; uses a pull ("scrape") model
PromQLPrometheus Query Language — rate(http_errors_total[5m])
AlertA rule that fires a notification when a metric crosses a threshold

The Problem

Logs tell you what happened event by event, but they're terrible for answering "is the system healthy right now?" Searching logs to determine the current error rate means scanning millions of recent lines on every dashboard refresh — too slow and too expensive for real-time monitoring.

Metrics solve this by pre-aggregating: instead of storing every individual event, you store a running count (1,204,300 requests total) and let the metrics system compute rates and percentiles on query. This makes dashboard refreshes take milliseconds and alert evaluations take microseconds.

In plain terms: metrics are the vital signs monitor in a hospital room — constantly updating numbers (heart rate, blood pressure, oxygen saturation) that tell you the patient's status at a glance. Logs are the detailed medical notes — essential for understanding what happened, but you don't read the notes to check if the patient is alive right now.

The Four Golden Signals

Google's Site Reliability Engineering book identifies four metrics that, together, characterize any service's health. If these four are healthy, your service is almost certainly healthy. If any one is degraded, something is wrong.

Latency — how long does it take to serve a request? Always measure as percentiles, not averages. Average latency can be 50ms while p99 latency is 3 seconds (1% of your users wait 3 seconds — terrible). Measure p50 (median), p95, and p99. Alert on p99.

Traffic — how much demand is the system serving? Requests per second, transactions per minute, bytes per second. Traffic spikes may predict future latency/saturation problems before they materialize.

Errors — what fraction of requests are failing? Both explicit failures (HTTP 5xx, exception thrown) and silent failures (HTTP 200 but wrong data). Alert when error rate exceeds your SLO error budget.

Saturation — how full is the system? CPU utilization, memory usage, disk I/O, connection pool utilization, queue depth. Saturation is often a leading indicator — high CPU at 80% today predicts failures tomorrow at 95%.

In plain terms: latency = speed, traffic = volume, errors = quality, saturation = capacity. A healthy service is fast, handles the load, rarely fails, and isn't about to run out of resources.

Tiny example: your payment service handles 500 req/s with p99 latency of 180ms, error rate 0.1%, CPU at 45%. All four golden signals are green. Then latency jumps to 1.2s at p99 — the first signal to fire. You look at saturation: DB connection pool is at 98% utilization (saturation signal). Root cause: a slow query is holding connections longer, exhausting the pool. Fix the query → latency returns to 180ms → pool utilization drops to 40%. All four signals green again.

Counter: Track Totals and Rates

A counter only goes up (or resets to zero if the process restarts). It records a running total of how many times something has happened:

http_requests_total{service="payment", method="POST", status="200"} 4812300
http_requests_total{service="payment", method="POST", status="500"} 4821

You never alert on the raw counter value — 4,812,300 requests is meaningless without knowing the time window. You compute the rate of change:

# Requests per second over the last 5 minutes
rate(http_requests_total{service="payment"}[5m])
→ 1204.3 req/s
 
# Error rate (fraction of 5xx over total)
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
→ 0.0023  (0.23% error rate)

Labels add dimensions to counters. {service, method, status} lets you filter by just POST requests, just 500s, just the payment service. Warning: don't use user IDs, order IDs, or other high-cardinality values as labels. Each unique label combination creates a separate time-series in Prometheus — adding user_id to a counter would create millions of time-series (one per user), consuming gigabytes of RAM and crashing Prometheus.

In plain terms: a counter is like an odometer. The raw reading (142,000 miles) tells you about the car's history. The rate (you drove 60 mph for the last hour) tells you what's happening now.

Gauge: Track Current State

A gauge represents the current value of something that can increase or decrease — it's a snapshot of state at the moment of measurement:

memory_used_bytes{service="payment"} 1,287,654,321   # 1.2 GB — healthy
active_connections{service="payment"} 42              # connection pool usage
queue_depth{queue="order-processing"} 150             # messages waiting
cpu_usage_percent{host="prod-1"} 78                   # getting high

Gauges are most useful for capacity and saturation monitoring. Alert examples:

  • memory_used_bytes > 3_000_000_000 → 3 GB threshold exceeded → scale up or investigate memory leak
  • queue_depth > 10_000 for 5 minutes → consumer is falling behind → investigate consumer or scale it
  • active_connections / max_connections > 0.9 → connection pool at 90% → likely to cause errors soon

In plain terms: a gauge is like a car's fuel gauge — it tells you the current level, not how much you've used in total. Essential for knowing if you're about to run out.

Concrete sketch: Kubernetes horizontal pod autoscaling (HPA) uses gauge metrics. When cpu_usage_percent stays above 70% for 2 minutes (gauge metric from the metrics-server), HPA automatically scales the deployment from 3 to 6 pods. When CPU drops below 40%, it scales back down. The entire autoscaling loop runs on gauge metrics.

Histogram: Measure Distributions

A histogram is the most powerful and most misunderstood metric type. It records how many observations fell into each predefined bucket:

# Latency histogram for the payment service
http_request_duration_seconds_bucket{le="0.01"}   980   # 980 requests ≤ 10ms
http_request_duration_seconds_bucket{le="0.1"}   1160   # 1160 requests ≤ 100ms
http_request_duration_seconds_bucket{le="0.5"}   1195   # 1195 requests ≤ 500ms
http_request_duration_seconds_bucket{le="1.0"}   1199   # 1199 requests ≤ 1s
http_request_duration_seconds_bucket{le="+Inf"} 1200   # all 1200 requests
http_request_duration_seconds_sum            42.3   # total time
http_request_duration_seconds_count         1200   # total requests

From these buckets, you can compute percentiles:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
→ 0.42  (p99 latency = 420ms)
 
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
→ 0.009  (p50 = 9ms — most requests are very fast)

Why percentiles matter: the average of [9ms, 9ms, 9ms, 9ms, 420ms] is 91ms — suggesting a slow service. The p99 is 420ms — one in a hundred users waits 420ms. The p50 is 9ms — most users experience 9ms. Average hides the long tail; percentiles expose it.

Choosing buckets: buckets must be pre-defined and should cover your expected range. For an API: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5] seconds. Too few buckets → imprecise percentiles. Too many → higher memory and storage cost.

In plain terms: a histogram is like a speed camera report. Instead of telling you "the average speed was 55mph" it tells you "90% of cars were under 60mph, 99% were under 75mph, and 1% were over 75mph." The distribution tells the real story.

Prometheus: The Scrape Model

Prometheus uses a pull model — instead of each service pushing metrics to Prometheus, Prometheus periodically (every 15 seconds by default) makes an HTTP GET to each service's /metrics endpoint:

GET http://payment-service:8080/metrics

Response:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="POST",status="200"} 4812300
http_requests_total{method="POST",status="500"} 4821
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1160
...

Prometheus stores these time-series in its TSDB (Time-Series Database), optimized for range queries and aggregations. Its query language, PromQL, is powerful but takes practice.

Service discovery: Prometheus finds services to scrape via Kubernetes API (annotate your pods with prometheus.io/scrape: "true"), Consul, or static config. New pods are auto-discovered; terminated pods are auto-removed from the scrape list.

Alerting: Prometheus evaluates alert rules every 15 seconds. When a condition holds true for a configured duration (e.g., error_rate > 0.05 for 5 minutes), Prometheus fires an alert to Alertmanager, which routes it to PagerDuty, Slack, email, etc.

Grafana reads from Prometheus via PromQL and renders time-series charts. Standard practice: import community dashboards (e.g., the Kubernetes cluster overview dashboard from grafana.com) and customize for your services.

The Trade-offs

PropertyMetricsLogs
Storage costVery low (aggregated)High (every event)
Query speedMillisecondsSeconds to minutes
Detail levelAggregated — no event contextFull event detail
Best forAlerts, dashboards, trendsDebugging specific incidents
Cardinality riskHigh — must manage labels carefullyLow — any field, any value

Why this matters for you

Instrument every service you build with the four golden signals from day one. The minimum viable metrics setup: a counter for total requests (labeled by status code), a histogram for request duration, and a gauge for active connections or queue depth. For Prometheus, use client libraries (prometheus-client for Python, micrometer for Java, prom-client for Node). Alert on p99 latency (not average), error rate (not error count), and saturation metrics (not raw CPU %). The most common mistake: using high-cardinality labels like user IDs or request IDs — these cause Prometheus to run out of memory and restart.

Next: Distributed Tracing — following a request through multiple services to find where latency lives.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Four golden signals (Google SRE) — Latency, Traffic, Errors, Saturation. If all four are healthy, your service is healthy.