Build systems that behave correctly under load and partial failure.
If you are new here: Reliability is correctness under real-world mess: slow disks, flaky networks, bad deploys, and retries. A system can be available (returns HTTP 200) and still unreliable (wrong numbers, lost writes).
| Term | Plain meaning |
|---|---|
| Durability | After you “committed,” data survives crashes and restarts |
| Idempotency | Doing the same operation twice is safe (no double charge) |
| Fault tolerance | Design so one failure does not corrupt data or take down everything |
| Reconciliation | Batch or streaming checks that compare two sources of truth |
Your dashboard is green. Latency looks fine. Uptime is 99.99% on the chart. Then finance runs a reconciliation job and discovers that thousands of balances were wrong for six hours — not because the database was down, but because a replica served stale reads and nobody noticed.
The API never stopped returning 200 OK. Users did not see a spinner or a 500 page — so monitoring said you were “up.” The failure was silent until someone reconciled the books.
That tension is what Reliability is about: not only staying online, but behaving correctly when you are.
Reliability means the system does the right thing, consistently: correct data, valid invariants, durable writes — not just HTTP 200.
Analogy: Think of a bank teller who is always at the window (available) but sometimes hands you another customer’s slip. The line moved; your request “completed.” The outcome is still unacceptable. Reliability is about matching reality — ledger balance, inventory count, permission check — not just returning bytes on time.
Under the hood, reliable systems prove correctness with transactions, constraints, replication protocols, and tests that fail loudly instead of whispering wrong numbers into the UI.
The cost: strong guarantees usually add latency, coordination, and operational rigor. You earn trust in exchange for complexity.
Durability is the sibling you need when someone asks “did we lose the money?” A response can be available (you got 200 OK) and still not durable if the write only lived in RAM and the host died before disk.
In plain terms: durability means committed data survives process crashes, power loss, and restarts — usually via write-ahead logs, fsync boundaries, and replication after the storage layer acknowledges the write.
Analogy: Think of mailing a cheque: “I clicked send” is not the same as “the bank cleared it.” Durability is the clearing step — without it, reliability is theater.
The cost: fsync and replication add latency on the write path. High-throughput systems sometimes flirt with async replication; that is a conscious trade of durability for speed, not a free lunch.
Not every failure is a server catching fire. Some are electrons, some are logic, some are humans on call at 2am.
Hardware faults are physics: memory bit flips, disk sectors, power blips, NICs that drop packets. Software faults are mistakes in code and concurrency: race conditions, bad assumptions, dependency upgrades that change behavior. Human faults are process: wrong config, skipped step in a runbook, terraform apply against prod.
In plain terms: reliability engineering starts by sorting what can break — because you cannot test or monitor “everything” as one blob.
Analogy: Think of three doors into the same house: you lock the front door, the windows, and the garage differently. Checksums and ECC address hardware. Code review and fuzzing address software. Automation, guardrails, and blast-radius limits address humans.
The cost: defense in depth means multiple layers — each with its own tools and on-call playbooks.
You will never eliminate faults. Fault tolerance means designing so a fault does not become a user-visible lie or a cascading collapse.
Checksums and hashes detect corruption on the wire or on disk — fail closed or retry instead of trusting garbage. Retries with backoff cover transient failures, but only work safely with idempotency keys so the fifth retry does not charge a card five times. Circuit breakers stop one sick service from holding threads open until the whole fleet suffocates.
Idempotency example: A payment API might accept a header or body field the client generates once per user action:
POST /v1/charges HTTP/1.1
Host: payments.example.com
Idempotency-Key: 7b291f44-8c2e-4b1a-9d0e-3f8a1c2b4e5f
Content-Type: application/json
{"amount_cents": 1999, "customer": "cus_abc"}If the client times out and retries with the same key, the server returns the same charge result instead of creating a duplicate.
In plain terms: fault tolerance is mechanical empathy — assume components will fail, and give the system a way to recover without lying.
The cost: retries add latency; idempotency needs storage; circuit breakers need tuning. Cheap to sketch in a diagram, expensive to get right in production.
When the UI lies quietly, metrics will not save you — they aggregate success codes, not business truth. That is why mature systems run reconciliation jobs: compare API-reported totals to ledger rows, inventory snapshots to warehouse scans, ticket sales to payment provider settlements.
In plain terms: reconciliation is how you turn “silent wrong” into actionable signal — a diff, an alert, a ticket with a dollar amount attached.
Analogy: Think of nightly bank statements: the app can feel fine all day, but the statement is where fiction dies.
The cost: batch jobs, data pipelines, and someone owning the “source of truth” spreadsheet — operational overhead that pays for itself the first time it catches a seven-figure drift.
You will hear both terms in the same meeting. They are related, not interchangeable.
| Availability | Reliability | |
|---|---|---|
| Question | Is there a response? | Is the response correct and durable? |
| Bad day | Timeouts, 503s | Wrong balance, lost writes, silent corruption |
| Typical lever | Redundancy, failover, capacity | Transactions, consensus, testing, audits |
A read-heavy social feed can often favor availability — eventual consistency, stale tiles, background repair. A ledger or inventory system usually favors reliability — refuse to answer rather than guess.
When you design an endpoint, ask: if this returned a wrong value for an hour, would we fire someone? If yes, invest in correctness paths, strong read models, durable commit boundaries, reconciliation, and failure modes that fail closed. If no — maybe optimize for staying up and fix data in the background — but make that a deliberate product decision, not an accident your metrics hide.
The service never goes down — it cheerfully returns 200 with the wrong number. The outage is invisible until someone reconciles the books.