085 · SAGA · CHOREOGRAPHY · ORCHESTRATION

SAGA Pattern

Long-running distributed transactions using compensating actions instead of locks.

If you are new here: The SAGA pattern is a way to manage distributed transactions without a central lock — instead of locking all resources upfront and then committing, a SAGA executes a sequence of smaller, independent steps. Each step commits locally as it succeeds. If a later step fails, the SAGA runs compensating transactions — undo operations for each prior step — in reverse order. The result: no blocking locks, no coordinator single point of failure, and transactions that can span minutes or hours rather than milliseconds. The trade-off: intermediate states are visible, and you must implement every compensating transaction yourself.

TermPlain meaning
SAGAA sequence of local transactions, each with a compensating transaction to undo it
Compensating transactionAn application-level "undo" for a step that already committed
ChoreographyA SAGA style where services react to events independently — no central coordinator
OrchestrationA SAGA style where a central orchestrator calls each step and handles failures
Pivotal transactionThe step after which compensation becomes impossible or impractical
Intermediate stateThe partially-complete state visible between SAGA steps — not an error, but requires UX handling
IdempotencyThe property that running a step or compensation multiple times gives the same result

The Problem

Order fulfillment takes 4 steps: create order, reserve inventory, charge payment, schedule shipping. Each step touches a different service. The whole thing might take 2–10 seconds from start to finish.

Solving this with 2PC means holding locks on order records, inventory rows, and payment tables for the full 2–10 seconds. Under high traffic, thousands of concurrent orders would create a massive lock contention problem — transactions queuing up behind each other, timeouts cascading.

Even setting aside performance: some steps involve external APIs. You can't hold a database lock while waiting for a payment provider to respond. The payment provider doesn't participate in your database's lock protocol. 2PC simply doesn't work across external service boundaries.

SAGA solves this by removing locks entirely. Each step commits locally and immediately when it succeeds. If a later step fails, you don't "roll back" — you run a separate compensating transaction that semantically undoes the prior committed step.

In plain terms: instead of locking everything first and asking permission, SAGA does each step as a committed action, and has a specific "undo" procedure ready for each one.

Analogy: A multi-stop international flight booking. You separately book the outbound flight (step 1: committed, seat assigned), book a connecting flight (step 2: committed), then try to book the return leg — and it's sold out (step 3: fails). You don't "lock" both flights during the process. Instead, you cancel the outbound (compensating transaction 1) and the connecting (compensating transaction 2) bookings that already succeeded. Each cancellation is a real action with a real fee or process — not a magical database rollback.

Happy Path: Sequential Steps

In the successful case, a SAGA executes steps one at a time. Each step:

  1. Calls a service (or performs a local DB operation)
  2. Waits for confirmation
  3. Records that the step succeeded (in the orchestrator's state or in an event log)
  4. Proceeds to the next step

When the final step succeeds, the SAGA is complete. The "transaction" is done — not via a 2PC commit, but simply because all steps committed locally and we know they all succeeded.

Tiny example: E-commerce checkout SAGA:

  • Step 1: POST /orders → Order Service creates order #8812, status "pending". Returns 200.
  • Step 2: POST /inventory/reserve → Inventory Service reserves 1× item #5512 for order #8812. Returns 200.
  • Step 3: POST /payments/charge → Payments Service charges $129 to card ending 4242. Returns 200.
  • Step 4: POST /fulfillment/ship → Shipping Service schedules pickup for order #8812. Returns 200.

SAGA completes. Order #8812 status changes to "confirmed."

No cross-service lock was held at any point. Each service processed its step independently.

When a Step Fails: Compensating Transactions

If Step 3 (payment) fails — card declined, payment service timeout, fraud check — the SAGA must undo what has already succeeded.

The SAGA runs compensating transactions in reverse:

  • Compensation for Step 2: DELETE /inventory/reserve/{order_id} → releases the reserved inventory
  • Compensation for Step 1: PATCH /orders/{id} with status "cancelled" → marks the order cancelled

These are real API calls, real database operations. They're not database rollbacks — those steps already committed. Compensation is a new forward-moving action that semantically reverses the effect.

In plain terms: compensation is a "please undo that" message, not a database undo. Each step must have a corresponding "please undo that" operation defined up front.

What makes a good compensating transaction?

  • Idempotent: running it twice produces the same result (critical, because you might retry on transient failure)
  • Business-meaningful: it actually reverses the semantic effect, not just a database delete
  • Always succeeds: if compensation can fail, you have a much harder problem (stuck order, reserved inventory nobody can claim)

Concrete sketch: Stripe's refund API is a compensating transaction for a charge. stripe.refunds.create({charge: 'ch_xxx'}) semantically reverses the charge. It's a new operation, not a rollback — the charge history shows both the charge and the refund. This is correct SAGA compensation behavior.

Choreography: Event-Driven SAGAs

In a choreography-based SAGA, there is no central coordinator. Each service responds to events from an event bus. Service A publishes "Order Created." Service B listens, does its work, publishes "Inventory Reserved." Service C listens to that, does payment, publishes "Payment Charged." And so on.

Advantages of choreography:

  • No single point of failure (no coordinator to crash)
  • Services are truly decoupled — they don't know about each other, just about events
  • Easy to add new steps (new service subscribes to existing events)

Disadvantages of choreography:

  • Hard to understand the overall flow — it's distributed across many services
  • Hard to debug failures — which service failed? What was the state at failure?
  • Compensating transactions must also be event-driven, which adds significant complexity
  • Difficult to ensure SAGA-level ordering and consistency across many independent event handlers

Analogy: A jazz improvisation. Each musician listens to the others and responds organically, no conductor needed. Beautiful when it works. Hard to debug when something goes wrong — who played the wrong note first?

Choreography works well for simple, well-understood flows with 3–4 steps. It gets hard to maintain as complexity grows.

Orchestration: Centralized Control

In an orchestration-based SAGA, a dedicated Saga Orchestrator service calls each step explicitly and manages the state machine of the SAGA. It knows: "I'm in step 2, waiting for inventory confirmation. If I get a success, proceed to step 3. If I get a failure, run compensation for step 1."

Advantages of orchestration:

  • The SAGA's logic is in one place — easy to read, debug, and modify
  • The orchestrator can be stateful — it tracks exactly where in the flow it is
  • Timeouts and retry policies are centralized
  • Compensation logic is clear and co-located with the forward logic

Disadvantages of orchestration:

  • The orchestrator becomes a central dependency
  • It needs to be highly available and fault-tolerant (use durable state like a database or workflow engine)
  • Risk of the orchestrator becoming a God object that knows too much about all services

Concrete sketch: Temporal.io is a popular workflow engine used for orchestration-based SAGAs. Your checkout SAGA is a Temporal Workflow — a durable function that executes steps, handles retries, and can be suspended/resumed if it crashes. Temporal persists the workflow's execution history, so if the orchestrator process crashes, the next instance picks up exactly where it left off.

The Visible Intermediate State Problem

This is SAGA's most important operational challenge. Between step 1 (order created) and step 3 (payment charged), the system is in a partially complete state. The order exists in the Orders database, but payment hasn't been confirmed yet.

During this window:

  • A customer support agent looking at the order sees it as "pending" with no payment
  • Analytics dashboards might count it as a new order before payment is confirmed
  • Another process might try to act on the order before the SAGA completes
  • If the SAGA fails at step 3, compensation must clean up steps 1 and 2 — during which time the partially-cancelled state is briefly visible

In plain terms: SAGA trades "invisible intermediate state behind a lock" for "visible intermediate state that requires careful UI and query design."

Managing this requires:

  • Status fields that reflect SAGA progress: "pending_payment," "confirmed," "payment_failed," "cancelled"
  • Query guards that filter out partially-complete SAGAs from user-facing views
  • Idempotency keys so compensating transactions can be safely retried
  • Dead-letter handling for compensations that fail — you need an alert and manual process for stuck SAGAs

The Trade-offs

SAGA and 2PC solve the same problem — distributed atomicity — in fundamentally different ways:

PropertySAGA2PC
Lock holdingNo locks across servicesLocks held during protocol
Operation durationMinutes or hoursMust complete in seconds
External servicesWorks (no lock protocol needed)Doesn't work (external APIs can't participate)
Intermediate visibilityYes — you must handle this in UXNo — hidden behind locks
Failure recoveryApplication-written compensationsProtocol-managed rollback
ThroughputHigh — parallel SAGAs don't contendLower — lock contention limits parallelism
Code complexityHigh — every step needs a compensationLower — protocol handles rollback

Why this matters for you

SAGA is the dominant pattern for distributed transactions in modern microservice architectures. Order fulfillment, subscription management, loan applications, multi-step onboarding — any workflow that touches multiple services and takes more than a second belongs in a SAGA. Before implementing, answer three questions for every step: "What is this step's compensating transaction? What happens if the compensation fails? How does the UI/API behave during intermediate states?" If you can't answer all three, you're not ready to implement the SAGA yet.

Next: Outbox Pattern — how to reliably publish events as part of a transaction without a distributed coordinator.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Happy path: SAGA executes steps sequentially — each step commits locally when it succeeds.