002 · UPTIME · SLA · 99.9%

Availability

Keep systems operational even when components fail.

If you are new here: Availability answers: “Can users still use the product when something breaks?” It is not about perfect code — it is about redundant paths, failover, and honest degraded behavior when a dependency is down.

TermPlain meaning
Uptime / availabilityShare of time the system successfully serves users (often as “nines”)
RedundancyMore than one way to complete the same job (extra servers, AZs, regions)
FailoverTraffic moves to healthy capacity when a piece fails
Degraded modeCore flows still work; some features are off or slower

The Problem

Black Friday traffic is holding steady. Your API and database live in one AZ on a single t3.large, CPU looks fine, and nobody is talking about redundancy because the whole stack has not failed yet.

Then the hypervisor glitches, the instance disappears, and your load test from last week does not matter anymore. Every user, every mobile client, and every partner API that depended on that one host now gets connection errors at the same time — while RDS sits there with nothing left to talk to.

That moment is what Availability is about: when something breaks, does your system still answer requests? Not "was the code perfect" — can users still complete a checkout while you fix the broken piece?

When the Only Box Dies

PagerDuty is already open, the status page is red, and every request still targets the same dead host — so for your users, availability just hit zero, even if the rest of the region is fine.

Availability means successful responses from the system as a whole, within the window you promised — not immortality for every machine.

Analogy: Think of a single-lane bridge. Traffic flows until a truck clips a support. The bridge does not degrade gracefully into "half a bridge." It is closed, and every driver is stuck. A single app server is the same story: healthy feels great until it is not, and then availability is zero for everyone routed there.

Under the hood, a lone instance has no backup path. DNS still points at it, the connection pool still targets it, and retries just hammer a dead endpoint faster. Until you add another healthy place for traffic to land, "down" is the whole story.

The cost: you learn the hard way that mean time between failures does not help users during mean time to repair.

Redundancy: A Second Path

The load balancer marks the target in us-east-1a unhealthy, sends every new connection to the peer in us-east-1b, and Aurora keeps accepting queries from the survivor — often before you have opened an SSH session.

In plain terms: availability comes from having more than one way to succeed. If one path is on fire, another path picks up the work.

Analogy: Think of airport security: when one lane closes, the line does not end — people shift to the lanes that are still open. The total throughput changes a little, but travelers still get through. Redundancy is what makes that rerouting possible.

Under the hood, you spread instances across AZs, wire health checks so bad targets drain, and keep app servers stateless so any healthy peer can serve the request. The load balancer only forwards to targets that pass those checks, so a crashed VM stops receiving traffic while the survivors keep answering. You still have to think about sessions, caches, and deploys — but users stop seeing total darkness when one box fails.

The cost: you pay for extra capacity you hope sits idle, and you own more moving parts — ALB rules, target groups, deploy coordination, and the discipline to keep instances truly interchangeable.

Degraded Mode — Not Every Outage Is Total

Sometimes the edge and the product catalog are fine while one dependency — payments, fraud scoring, identity — is red. Users browse, add to cart, and only discover the pain at checkout.

In plain terms: availability is not always binary at the whole-company level. You can be “up” for read-heavy paths and “down” for money-moving paths. Status pages and dashboards need to reflect that partial story, or you will gaslight your own team.

Analogy: Think of an airport with one runway closed: planes still land elsewhere, but international connections might be cancelled. Total “airport open” is true; your trip can still be ruined.

The cost: degraded mode requires feature flags, graceful error copy, queueing retries, and honest health APIs — more product and backend work than a single global “up/down” boolean.

The Language of Nines

Sales decks love phrases like "four nines," but the number only matters once you translate it into a staircase of allowed outage per year — time people can actually feel.

99% uptime sounds respectable until you do the math: that is more than three and a half days of allowed downtime per year. 99.9% shrinks that to about nine hours. 99.99% is roughly 52 minutes, and 99.999% is about five minutes — the whole year.

In plain terms: each nine you add is an order of magnitude less failure budget. It is the difference between "we can reboot during lunch" and "we can barely sneeze."

Analogy: Think of it like compound interest in reverse: small percentage points erase huge chunks of calendar time. Teams use these targets to write SLAs, size on-call rotations, and decide whether that extra replica is insurance or vanity.

The cost: higher nines require tighter automation, better testing, faster failover, and often more geographic redundancy. You are not buying a label — you are buying the engineering work to stay inside a smaller box.

SLA vs SLO (briefly): vendors publish SLAs (contractual, credit-bearing). Your team sets SLOs (internal targets, usually stricter than the SLA so you have buffer). The “nines” conversation should name which one you are talking about.

Worked example (orders of magnitude): For a 365-day year, 99.9% allows roughly 8h 46m total outage budget; 99.99% allows about 52m; 99.999% about 5m. Product and finance care about minutes per year, not the percentage in isolation.

The Trade-offs

There is no free lunch. Every hour of redundancy is paid for in instances, load balancers, and engineering time. Every hour of outage is paid for in revenue, refunds, and reputation.

If you…You gain…You pay…
Run N+1 app serversRoom for one failure without user-visible downtimeMonthly compute + more deploy complexity
Skip redundancy "for now"Lower bill todayA single incident can dominate the savings
Contract a strict SLAClear expectations with customersCredits and legal exposure when you miss it

The real question is not "should we be highly available?" It is which failures are acceptable for this product, and what are we willing to spend to survive the ones that are not?

Why this matters for you

When you sketch your next service, decide early: is an hour of downtime a bad quarter, or a firing offense? If it is the latter, design for redundant paths, health-checked traffic, degraded UX when a dependency fails, and runbooks before you need them. If you are prototyping, maybe a single box is fine — but call that a conscious trade, not an accident you discover during someone else's launch day.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Traffic rides one public path: users hit a single app in us-east-1a, then one RDS — the whole customer experience shares one failure domain.