060 · SPOF · REDUNDANCY · HA

Single Point of Failure

Identify and eliminate components whose failure takes down the entire system.

If you are new here: A single point of failure (SPOF) is any one component in your system where, if it breaks, the whole service goes down — not just a feature, but everything. One server, one load balancer, one database, one network cable, one person who holds all the passwords. High availability starts by naming these points and either removing them or adding a backup that can take over automatically.

TermPlain meaning
SPOFOne component whose failure kills the product entirely
RedundancyA standby copy or path that kicks in when the primary fails
FailoverThe automatic (or manual) switch to the healthy backup
Blast radiusHow much of the system breaks when one piece goes down
RTORecovery time objective — how many minutes until service is restored

The Problem

It's 2pm on a Tuesday, and your deployment is healthy: one load balancer, one application server, one database — all green in the monitoring dashboard. Then the database host's SSD fails. Every API call returns 500. Users can't log in. Orders can't be placed. The root cause isn't complicated — there simply is no second database to promote.

That is what a SPOF looks like in practice. Your architecture diagram was a straight line from users to data, and someone pulled out one link. There was nothing to fall back to.

The pattern is the same across every layer. One load balancer VM stops responding to health checks at 3am — every user is unreachable until an engineer wakes up. One NAT gateway for all outbound traffic hits its connection limit — every service calling external APIs quietly starts timing out. One cloud region goes down — the whole product goes dark if you never planned a second region.

In plain terms: a SPOF is any box in your diagram where you can write "if this dies, we're down" — and you don't have a second box waiting to take over.

Analogy: Think of a single-lane bridge into a city. It doesn't matter how many taxis and buses are ready on both sides. If the bridge is out, nobody crosses. The redundant capacity on the shore means nothing without a redundant crossing.

Finding the SPOFs

The way to find SPOFs is to walk your architecture diagram — request path first, then data path — and ask "what happens if this exact box disappears right now?" for every component you draw.

Common SPOFs engineers miss: A pair of app servers behind a single load balancer VM is not redundant — the LB is the SPOF. A database with a manual backup that takes three hours to restore is not the same as a standby replica that promotes in 30 seconds. A DNS record pointing at one IP, a TLS certificate stored only on one server, a single on-call engineer who is the only person with production credentials — all of these are SPOFs.

Tiny example: You run two EC2 instances for your API. They share one RDS primary with no read replicas and no Multi-AZ setup. If the primary fails, both instances are instantly broken. The EC2 redundancy is real; the database redundancy is missing. Your blast radius for a DB failure is 100%.

Redundant Load Balancing

Cloud-managed load balancers (AWS ALB, GCP GLB) are already distributed across multiple zones by design — you get LB redundancy for free by using the managed service. The ones to watch out for are self-hosted software load balancers: HAProxy or Nginx running on a single VM is itself a SPOF.

The solution is active/passive pairs sharing a virtual IP (VIP). When the active LB stops responding, a keepalived or similar process reassigns the VIP to the passive standby. Users' DNS resolves to the same address; traffic resumes without a DNS change, usually within a few seconds.

In plain terms: the load balancer is the first thing traffic hits. If it has no standby, all your redundant app servers are unreachable the moment it dies.

Redundant Data Tier

The database is usually the hardest SPOF to eliminate because state lives there. The minimum viable approach is synchronous (or near-synchronous) replication with automatic failover: a primary accepts writes and streams every transaction to a standby replica. If the primary dies, the replica promotes itself and the application reconnects.

AWS RDS Multi-AZ does this for you: primary and standby in separate availability zones, synchronous replication, automatic failover in typically 60–120 seconds. You still see errors during that window, which is why RTO matters. But you are no longer down indefinitely waiting for someone to manually restore a backup.

In plain terms: a backup you restore from is not the same as a replica that promotes. One is an hour of downtime; the other is a minute.

Spreading Across Availability Zones

An availability zone (AZ) is a physically separate data center within a region — separate power, separate cooling, separate network paths. Spreading your instances, databases, and caches across at least two (ideally three) AZs means a power outage or networking failure in one AZ does not take you down entirely. You lose some capacity, but the service survives.

Three AZs with roughly equal capacity is the right shape: lose one AZ and you still run at around two-thirds capacity. That's usually enough to survive a real incident while you investigate and recover.

Analogy: Three offices in three buildings across a city. If one building loses power, the other two keep running. If all three were in the same building, one power cut shuts down the company.

In plain terms: AZ redundancy isn't extra; it's the baseline for anything with an SLA.

The Trade-offs

Redundancy is not free. Every standby replica, every second AZ deployment, every active/passive LB pair adds to your monthly infrastructure bill and your operational complexity. Running two databases instead of one means two sets of backups, two upgrade schedules, twice the monitoring surface.

The calculation is: what does an hour of downtime cost, in lost revenue, customer trust, and engineer time? Compare that to the monthly cost of the redundancy. For most products, the math favors redundancy strongly. For some internal tools, accepted downtime during off-hours is genuinely fine.

More redundancyLess redundancy
BenefitSurvives component failures with minimal user impactSimpler, cheaper to run
CostHigher monthly bill, more moving parts to operate and testLarger blast radius; outages last longer when they happen

Document the SPOFs you have chosen to accept as explicit decisions with owners and remediation plans — not as gaps you forgot to think about.

Why this matters for you

Before your next incident review, walk the critical user journey — login, checkout, or whatever your product's money flow is — and write down every SPOF in that path. Fix the highest-risk ones first. For the ones you accept, write down what your RTO is and who is responsible for recovery. Knowing your SPOFs in advance is not pessimism; it's the difference between a 10-minute recovery and a 4-hour war room.

Next: High Availability vs Fault Tolerance — what the difference between "recovers quickly" and "never goes down" actually means in practice.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

One load balancer, one app, one database in a single AZ — any single piece can take down every user path.