Stop calling a failing service to give it time to recover.
If you are new here: In a distributed system, your service usually calls other services — a payments API, a recommendation engine, a notification service. When one of those dependencies starts failing slowly, threads in your service pile up waiting for responses that never come. Eventually your service runs out of threads and becomes unresponsive too. The circuit breaker pattern puts a watchdog between your service and the dependency: when failure rates spike, it "opens" the circuit and stops sending requests, letting the downstream service recover without being hammered. It has three states — Closed, Open, and Half-Open — like a real electrical breaker.
| State | What happens | Why |
|---|---|---|
| Closed | Calls flow through normally; errors are counted | Normal operation |
| Open | Calls fail immediately without trying | Dependency is sick; protect it and yourself |
| Half-Open | One test request is sent | Check if the dependency has recovered |
It's Monday afternoon. Your checkout service calls an inventory API on every purchase. The inventory service's database starts running slow — not down, just slow. Every call to the inventory API now takes 15 seconds instead of 50 milliseconds.
Your checkout service has 20 worker threads. Each incoming purchase request grabs a thread, calls inventory, and blocks for 15 seconds waiting for a response. Within two minutes, all 20 threads are stuck waiting on inventory. New checkout requests have no thread to run on. The checkout service appears completely down, even though nothing is wrong with its own code.
This is a cascading failure: a problem in service B propagates upstream to kill service A, even though A itself is perfectly healthy. The inventory service's slowness didn't just hurt inventory — it poisoned the thread pool of every service that called it.
In plain terms: if you keep knocking on a door that won't open, you exhaust yourself. Eventually you can't answer your own door. A circuit breaker is the logic that says "stop knocking after a while."
Analogy: Think of a home circuit breaker. When a washing machine starts drawing too much current, the breaker trips before the wiring overheats. The rest of the house keeps its lights on. You don't fix the washing machine by leaving the circuit live — you cut it, investigate, then restore.
In the Closed state, the circuit breaker sits transparently in the call path — requests flow through to the dependency exactly as they would without it. The breaker is keeping a running count of successes and failures over a rolling time window (usually something like "the last 60 seconds" or "the last 100 requests").
When things are healthy, the breaker does nothing visible. But it's building the data it needs to act: failure rate, latency percentiles, consecutive error counts. Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (legacy Java) implement this out of the box. Service meshes like Envoy and Istio implement outlier detection, which is the same idea at the infrastructure layer.
In plain terms: everything looks normal, but the watchdog is watching.
When enough failures accumulate, the breaker trips from Closed to Open. What "enough" means is configurable: maybe it's more than 50% error rate over the last 60 seconds, or 10 consecutive failures, or p99 latency above 5 seconds.
The threshold matters enormously. Set it too tight and you trip the breaker on normal, temporary noise — causing more errors than you prevent. Set it too loose and you wait too long while the cascade has already started. The right value comes from knowing your dependency's baseline behavior: what does healthy look like, and what does degraded-but-not-failing look like?
Tiny example: If your payments API normally returns in 200ms and succeeds 99.5% of the time, a threshold of "50% errors in 60 seconds" is reasonable. If your dependency has a known noisiness and sometimes spikes to 5% errors for 10 seconds before recovering, set a wider window so you don't trip unnecessarily.
In the Open state, the circuit breaker immediately rejects calls without touching the dependency at all. Your service gets back an error in microseconds rather than waiting 15 seconds for a timeout. Threads are freed instantly. Your service stays responsive to users.
The downstream dependency gets breathing room: no new requests are hammering it while it's already struggling. A slow database recovery that might take 2 minutes under normal load might take 30 minutes if every service in the cluster keeps piling on requests.
This is the core value of the pattern: it protects both sides. The upstream service (yours) stays alive. The downstream service gets relief. A timeout alone doesn't achieve this — you still occupy a thread for the full timeout duration.
In plain terms: if the inventory service is on fire, there's no point sending more requests. Fail fast, keep your own threads free, and let inventory recover in peace.
The circuit doesn't stay open forever. After a configurable timeout — say 30 seconds — it transitions to Half-Open. In this state, it allows exactly one request through to the dependency. If that request succeeds, the breaker closes again and traffic resumes normally. If it fails, the breaker flips back to Open for another timeout period.
Half-Open is important because it makes recovery automatic. Without it, you'd need a human to manually re-enable traffic after every incident. With it, the system heals itself as soon as the dependency recovers. The one test request is a probe: cheap to send, and it gives you real signal about whether the downstream is truly healthy again.
In plain terms: after waiting a minute, you knock once to see if anyone's home. If they answer, you go in. If they don't, you wait another minute and try again.
The breaker being Open doesn't mean you have to return an error to users. The best implementations pair a circuit breaker with a fallback strategy: a cached previous result, a default value, a simplified version of the feature, or a queue for later processing.
For a recommendation engine that's down, serve the ten most popular items instead. For a notification service that's slow, queue the notification and send it in the next job run. For a pricing service that's flapping, serve the last known price and flag it as potentially stale. Users get a degraded but functional experience rather than an error page.
In plain terms: the backup plan matters as much as the breaker itself. "Fail fast" is not the same as "show an error." Often it means "serve something sensible from what we already know."
Circuit breakers add complexity: they are state machines with multiple configurations (thresholds, timeout windows, half-open probe counts), they require monitoring (alert when a breaker opens — that means a dependency is struggling), and fallbacks require product decisions about what degraded operation looks like.
The alternative — no circuit breaker — means cascade failures are your failure mode under load. In a microservices architecture with multiple dependencies, that's not an acceptable bet.
| Aggressive trip threshold | Lenient trip threshold | |
|---|---|---|
| Benefit | Protects upstream services quickly | Fewer unnecessary errors on transient noise |
| Cost | May cause errors when dependency is only briefly struggling | Cascade risk if you wait too long |
The right threshold is not "default settings from the library." It's calibrated from knowing your dependency's failure modes. Instrument and observe first; configure second.
Add circuit breakers to every synchronous call in a user-facing request path — especially calls to third-party APIs and shared internal services. Configure thresholds based on observed baselines, not guesses. Wire up alerts for when a breaker opens. And before you ship, decide what the fallback looks like: empty list, cached result, or graceful feature disablement.
Next: Bulkhead Pattern — isolate thread pools so one dependency can't starve out all your other calls.
A calls sick B; threads block waiting; A exhausts pools — the failure spreads upstream.