113 · OBSERVABILITY · PILLARS · TELEMETRY

Observability

Logs, metrics, and traces — the three pillars of understanding distributed systems.

If you are new here: Observability is the property of a system that tells you how well you can understand its internal state from its external outputs. A system is highly observable if, when something goes wrong, you can quickly figure out what broke, where it broke, and why — without logging into servers and guessing. The concept comes from control theory, but for software it boils down to three complementary data types: Logs (discrete timestamped events — "this thing happened"), Metrics (aggregated numbers over time — "error rate is 2% right now"), and Traces (cross-service request journeys — "this API call touched 6 services and spent 80ms in the database"). Together, these three pillars give you the full picture of a distributed system's health. Understanding observability is foundational for any engineer building or operating production systems — debugging without it is like trying to fix a car engine in the dark.

Term	Plain meaning
Observability	The ability to understand a system's internal state from its external signals
Telemetry	The data your system emits — logs, metrics, traces collectively
Three pillars	Logs, Metrics, Traces — the three primary signal types
Logs	Discrete, timestamped records of events — "payment failed for user 42"
Metrics	Numerical measurements aggregated over time — error rate, p99 latency, CPU usage
Traces	Records of a single request's journey through multiple services
Cardinality	How many unique values a label can take — high cardinality (user IDs) is expensive in metrics
Instrumentation	Adding code to emit telemetry — logging statements, metric counters, trace spans
OpenTelemetry	The CNCF standard for telemetry — one SDK to emit all three signal types

The Problem

Imagine your on-call pager fires at 2am: "checkout error rate elevated." You check your dashboard — something is wrong. But without observability, you're flying blind:

Is the database slow? Is it down? Is it a specific query?
Is it all users or just users in one region? Just mobile users?
Did it start when the last deployment went out 20 minutes ago?
Is it a downstream payment processor issue or something in your code?

In a monolith on one server, you might SSH in, tail the logs, and find the answer in 5 minutes. In a distributed system with 20 microservices, each running 10 instances across 3 regions, you need instrumentation that was built in before the incident — you can't add it retroactively at 2am.

In plain terms: observability is not about having dashboards. It's about building your system so that when it misbehaves, you can answer "where and why?" in minutes instead of hours — by looking at data, not by guessing.

Analogy: a modern aircraft cockpit vs flying by feel. A pilot instrument rating lets them fly through clouds using instruments alone. Observability is your instrument panel — airspeed (throughput), altitude (error rate), fuel (resource saturation), turn coordinator (latency). Without instruments, you can only fly in perfect conditions, and any turbulence becomes dangerous.

The First Pillar: Logs

Logs are the oldest and most intuitive observability signal. Every time something notable happens in your code — a request arrives, a payment fails, a user signs up, an exception is thrown — your code writes a record to a log.

A structured log entry looks like:

{
  "timestamp": "2024-01-01T09:00:01.342Z",
  "level": "ERROR",
  "message": "Payment charge failed",
  "service": "payment-service",
  "user_id": "u-42",
  "amount_cents": 9900,
  "error_code": "card_declined",
  "trace_id": "abc-123-xyz"
}

Structured vs unstructured: old-style logs are plain text ("ERROR: payment failed for user 42 at 09:00:01"). Modern practice is structured logging — JSON objects with defined fields. Structured logs can be queried: "show me all payment failures where amount_cents > 5000 in the last hour." Unstructured logs require regex parsing, which is fragile and slow.

In plain terms: logs answer "what happened, exactly, and to whom?" They're your audit trail. When a customer calls support saying their order failed at 9:00am Tuesday, logs let you find that exact event and see every detail.

Logs are high-volume: a busy service might emit millions of log lines per hour. This creates cost and query-speed challenges. Standard practice: log at INFO level for significant events, DEBUG for fine-grained details (often disabled in production), WARN for recoverable issues, ERROR for failures that need attention.

Tiny example: Elasticsearch + Kibana (the "ELK stack") or Grafana Loki are the two most common log storage and query platforms. You ship logs from your app to a central store, then search: service=payment-service AND level=ERROR AND timestamp:[2024-01-01 TO now]. Results in seconds across billions of log lines.

The Second Pillar: Metrics

Metrics are aggregated numerical measurements taken at regular intervals. Where logs tell you what happened, metrics tell you how much and how often. They're designed for alerting and dashboards.

The four golden signals (from Google's SRE book):

Latency: how long requests take (p50, p95, p99 — not just average)
Traffic: how many requests per second
Errors: what fraction of requests fail
Saturation: how full your resources are (CPU %, DB connection pool usage)

Metric types:

Counter: ever-increasing number — total requests served, total errors. You graph the rate of change (requests/second), not the raw value.
Gauge: current value at a point in time — memory used, active connections, queue depth. Can go up or down.
Histogram: records the distribution of values — latency buckets: [0–10ms: 500, 10–100ms: 200, 100–1000ms: 30]. Lets you compute p95, p99 percentiles.

In plain terms: if logs are the flight data recorder (every event in detail), metrics are the cockpit instruments (real-time summary numbers). You glance at instruments constantly while flying; you check the recorder only after an incident.

Prometheus + Grafana is the standard open-source stack. Your app exposes a /metrics HTTP endpoint returning all metric values in Prometheus format. Prometheus scrapes that endpoint every 15 seconds and stores the time-series data. Grafana queries Prometheus and renders dashboards and alerts. A rule like "alert if error_rate > 0.05 for 5 minutes" pages the on-call engineer.

The Third Pillar: Traces

Distributed traces answer a question logs and metrics can't: "this request took 2 seconds — which service was responsible?"

A trace represents the full lifecycle of one request across all services it touches. It's made of spans — individual operations within the trace:

Trace: checkout request (total: 1.2s)
├─ API Gateway span: 5ms
├─ Order Service span: 200ms
│  ├─ DB query: 150ms  ← this is the bottleneck
│  └─ Cache check: 2ms
├─ Payment Service span: 800ms
│  ├─ Stripe API call: 750ms  ← also slow
│  └─ DB write: 15ms
└─ Notification Service span: 50ms (async)

Each span records: service name, operation name, start time, duration, status (OK/error), and key-value attributes (user ID, order ID, HTTP status code).

Trace propagation: for traces to work across services, a trace context (trace ID + span ID) must be passed in every request header. The W3C Trace Context standard defines traceparent: 00-<trace-id>-<span-id>-01. Each service reads this header, starts a child span, and passes the context to the next service.

In plain terms: traces are like a FedEx tracking page but for a request through your microservices. You can see exactly which "depot" (service) the request visited, how long it waited at each stop, and where it got stuck.

Jaeger and Zipkin are the common open-source trace backends. Tempo (Grafana Labs) stores traces and integrates with Grafana dashboards. You click a trace ID in a log, jump to the trace view, and see the waterfall — immediately identifying that the database query at hop 3 took 10× longer than usual.

The Three Pillars Together

Each pillar answers a different class of question. They're most powerful when connected:

Workflow example: Grafana fires an alert: "p99 latency for checkout is 3s (threshold: 500ms) — started 8 minutes ago."

Metrics identified the problem — p99 latency spike on checkout service
Logs narrow it down — filter logs for the same 8-minute window on checkout: level=WARN AND duration_ms > 2000 → all the slow requests have db_query=true
Traces confirm it — find a trace from during the spike → the waterfall shows payment-service → DB query taking 2.8s. Click through to the DB span attributes: query="SELECT * FROM orders WHERE user_id = ?" → missing index on user_id

Each pillar gave you a different piece of the puzzle. Using only one: guessing. Using all three: diagnosed in under 5 minutes.

OpenTelemetry (OTel) is the emerging standard for unified instrumentation. One SDK emits all three signal types. Vendor-agnostic: send to Jaeger, Prometheus, Datadog, or any backend. Define once, change backends without rewriting instrumentation.

The Trade-offs

Observability has real costs — the signals don't instrument themselves:

Signal	Cost	Query speed	Best for
Logs	High storage volume	Slow on large datasets	Debugging specific events, audit trails
Metrics	Low (aggregated)	Very fast	Alerting, dashboards, trend analysis
Traces	Medium (sampled)	Medium	Latency attribution, service dependency

Why this matters for you

You can't add observability after the fact — it must be designed in from the start. The minimum viable observability stack for any production service: (1) structured logs shipped to a central store (Loki, CloudWatch, Datadog); (2) metrics exposed on /metrics scraped by Prometheus, with alerts on the four golden signals; (3) distributed tracing with OpenTelemetry, even if only sampling 1% of traces initially. This combination gives you the tools to debug any production incident in minutes rather than hours.

Next: Logging — structured logging, log levels, and building a log pipeline that's queryable at scale.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom

FRAME 1 OF 6

The black-box problem — without observability, your system is opaque. Something is slow or broken, but you have no way to know where or why.