Logs, metrics, and traces — the three pillars of understanding distributed systems.
If you are new here: Observability is the property of a system that tells you how well you can understand its internal state from its external outputs. A system is highly observable if, when something goes wrong, you can quickly figure out what broke, where it broke, and why — without logging into servers and guessing. The concept comes from control theory, but for software it boils down to three complementary data types: Logs (discrete timestamped events — "this thing happened"), Metrics (aggregated numbers over time — "error rate is 2% right now"), and Traces (cross-service request journeys — "this API call touched 6 services and spent 80ms in the database"). Together, these three pillars give you the full picture of a distributed system's health. Understanding observability is foundational for any engineer building or operating production systems — debugging without it is like trying to fix a car engine in the dark.
| Term | Plain meaning |
|---|---|
| Observability | The ability to understand a system's internal state from its external signals |
| Telemetry | The data your system emits — logs, metrics, traces collectively |
| Three pillars | Logs, Metrics, Traces — the three primary signal types |
| Logs | Discrete, timestamped records of events — "payment failed for user 42" |
| Metrics | Numerical measurements aggregated over time — error rate, p99 latency, CPU usage |
| Traces | Records of a single request's journey through multiple services |
| Cardinality | How many unique values a label can take — high cardinality (user IDs) is expensive in metrics |
| Instrumentation | Adding code to emit telemetry — logging statements, metric counters, trace spans |
| OpenTelemetry | The CNCF standard for telemetry — one SDK to emit all three signal types |
Imagine your on-call pager fires at 2am: "checkout error rate elevated." You check your dashboard — something is wrong. But without observability, you're flying blind:
In a monolith on one server, you might SSH in, tail the logs, and find the answer in 5 minutes. In a distributed system with 20 microservices, each running 10 instances across 3 regions, you need instrumentation that was built in before the incident — you can't add it retroactively at 2am.
In plain terms: observability is not about having dashboards. It's about building your system so that when it misbehaves, you can answer "where and why?" in minutes instead of hours — by looking at data, not by guessing.
Analogy: a modern aircraft cockpit vs flying by feel. A pilot instrument rating lets them fly through clouds using instruments alone. Observability is your instrument panel — airspeed (throughput), altitude (error rate), fuel (resource saturation), turn coordinator (latency). Without instruments, you can only fly in perfect conditions, and any turbulence becomes dangerous.
Logs are the oldest and most intuitive observability signal. Every time something notable happens in your code — a request arrives, a payment fails, a user signs up, an exception is thrown — your code writes a record to a log.
A structured log entry looks like:
{
"timestamp": "2024-01-01T09:00:01.342Z",
"level": "ERROR",
"message": "Payment charge failed",
"service": "payment-service",
"user_id": "u-42",
"amount_cents": 9900,
"error_code": "card_declined",
"trace_id": "abc-123-xyz"
}Structured vs unstructured: old-style logs are plain text ("ERROR: payment failed for user 42 at 09:00:01"). Modern practice is structured logging — JSON objects with defined fields. Structured logs can be queried: "show me all payment failures where amount_cents > 5000 in the last hour." Unstructured logs require regex parsing, which is fragile and slow.
In plain terms: logs answer "what happened, exactly, and to whom?" They're your audit trail. When a customer calls support saying their order failed at 9:00am Tuesday, logs let you find that exact event and see every detail.
Logs are high-volume: a busy service might emit millions of log lines per hour. This creates cost and query-speed challenges. Standard practice: log at INFO level for significant events, DEBUG for fine-grained details (often disabled in production), WARN for recoverable issues, ERROR for failures that need attention.
Tiny example: Elasticsearch + Kibana (the "ELK stack") or Grafana Loki are the two most common log storage and query platforms. You ship logs from your app to a central store, then search: service=payment-service AND level=ERROR AND timestamp:[2024-01-01 TO now]. Results in seconds across billions of log lines.
Metrics are aggregated numerical measurements taken at regular intervals. Where logs tell you what happened, metrics tell you how much and how often. They're designed for alerting and dashboards.
The four golden signals (from Google's SRE book):
Metric types:
[0–10ms: 500, 10–100ms: 200, 100–1000ms: 30]. Lets you compute p95, p99 percentiles.In plain terms: if logs are the flight data recorder (every event in detail), metrics are the cockpit instruments (real-time summary numbers). You glance at instruments constantly while flying; you check the recorder only after an incident.
Prometheus + Grafana is the standard open-source stack. Your app exposes a /metrics HTTP endpoint returning all metric values in Prometheus format. Prometheus scrapes that endpoint every 15 seconds and stores the time-series data. Grafana queries Prometheus and renders dashboards and alerts. A rule like "alert if error_rate > 0.05 for 5 minutes" pages the on-call engineer.
Distributed traces answer a question logs and metrics can't: "this request took 2 seconds — which service was responsible?"
A trace represents the full lifecycle of one request across all services it touches. It's made of spans — individual operations within the trace:
Trace: checkout request (total: 1.2s)
├─ API Gateway span: 5ms
├─ Order Service span: 200ms
│ ├─ DB query: 150ms ← this is the bottleneck
│ └─ Cache check: 2ms
├─ Payment Service span: 800ms
│ ├─ Stripe API call: 750ms ← also slow
│ └─ DB write: 15ms
└─ Notification Service span: 50ms (async)
Each span records: service name, operation name, start time, duration, status (OK/error), and key-value attributes (user ID, order ID, HTTP status code).
Trace propagation: for traces to work across services, a trace context (trace ID + span ID) must be passed in every request header. The W3C Trace Context standard defines traceparent: 00-<trace-id>-<span-id>-01. Each service reads this header, starts a child span, and passes the context to the next service.
In plain terms: traces are like a FedEx tracking page but for a request through your microservices. You can see exactly which "depot" (service) the request visited, how long it waited at each stop, and where it got stuck.
Jaeger and Zipkin are the common open-source trace backends. Tempo (Grafana Labs) stores traces and integrates with Grafana dashboards. You click a trace ID in a log, jump to the trace view, and see the waterfall — immediately identifying that the database query at hop 3 took 10× longer than usual.
Each pillar answers a different class of question. They're most powerful when connected:
Workflow example: Grafana fires an alert: "p99 latency for checkout is 3s (threshold: 500ms) — started 8 minutes ago."
checkout servicecheckout: level=WARN AND duration_ms > 2000 → all the slow requests have db_query=truepayment-service → DB query taking 2.8s. Click through to the DB span attributes: query="SELECT * FROM orders WHERE user_id = ?" → missing index on user_idEach pillar gave you a different piece of the puzzle. Using only one: guessing. Using all three: diagnosed in under 5 minutes.
OpenTelemetry (OTel) is the emerging standard for unified instrumentation. One SDK emits all three signal types. Vendor-agnostic: send to Jaeger, Prometheus, Datadog, or any backend. Define once, change backends without rewriting instrumentation.
Observability has real costs — the signals don't instrument themselves:
| Signal | Cost | Query speed | Best for |
|---|---|---|---|
| Logs | High storage volume | Slow on large datasets | Debugging specific events, audit trails |
| Metrics | Low (aggregated) | Very fast | Alerting, dashboards, trend analysis |
| Traces | Medium (sampled) | Medium | Latency attribution, service dependency |
You can't add observability after the fact — it must be designed in from the start. The minimum viable observability stack for any production service: (1) structured logs shipped to a central store (Loki, CloudWatch, Datadog); (2) metrics exposed on /metrics scraped by Prometheus, with alerts on the four golden signals; (3) distributed tracing with OpenTelemetry, even if only sampling 1% of traces initially. This combination gives you the tools to debug any production incident in minutes rather than hours.
Next: Logging — structured logging, log levels, and building a log pipeline that's queryable at scale.
The black-box problem — without observability, your system is opaque. Something is slow or broken, but you have no way to know where or why.