086 · OUTBOX · CDC · AT-LEAST-ONCE

Outbox Pattern

Guarantee message delivery by writing events to an outbox table in the same transaction.

If you are new here: The Outbox Pattern solves a deceptively common bug: you save a record to your database and try to publish an event to a message queue — but the publish fails silently. The record was saved; the event was never delivered. Downstream services never knew the record existed. The Outbox Pattern fixes this by writing the event into a special outbox table in your database as part of the same transaction that writes the business record. Then a separate relay process reads from the outbox and publishes to the message bus. If the relay fails, it can retry — the event is still in the outbox. If the database transaction fails, neither the record nor the event exists.

TermPlain meaning
Outbox tableA database table that stores events to be published, written in the same transaction as business data
Relay processA background service that reads from the outbox and publishes events to a message bus
Dual-writeAttempting to write to two separate systems (DB and message bus) without atomicity — the anti-pattern this solves
At-least-once deliveryA guarantee that every event will be published — but possibly more than once; consumers must be idempotent
CDC (Change Data Capture)Reading the database's write-ahead log to capture changes, rather than polling a table
DebeziumA popular open-source CDC tool that tails database WALs and publishes to Kafka
Idempotent consumerA consumer that produces the same result regardless of how many times it processes the same event

The Problem

Your Order Service creates a new order and needs to notify the Inventory Service so it can reserve stock. The naive implementation:

await db.query("INSERT INTO orders ...");   // Step 1
await messageQueue.publish("OrderCreated", order);  // Step 2

This looks reasonable. But consider what happens if step 2 fails:

  • Network blip to the message bus
  • Message bus is briefly down for a restart
  • The process crashes between step 1 and step 2

The order is in your database. The "OrderCreated" event was never published. The Inventory Service has no idea the order exists. Stock is never reserved. The customer sees a confirmed order, but fulfillment is stuck.

Now consider the reverse: what if step 1 fails?

  • The event was published, but the order was never saved
  • Inventory reserves stock for a nonexistent order
  • Payment processes against an order that doesn't exist in Orders DB

These are real production bugs. Every distributed system that tries to write to a database and publish a message atomically faces this problem — and the naive solution (just do both) fails in all the ways above.

In plain terms: you can't make a database write and a message queue publish atomic without a distributed transaction coordinator — but a distributed transaction coordinator is expensive and complex. The outbox pattern is the simpler, battle-tested alternative.

Analogy: Think of postal mail. Instead of handing your letter directly to the mail carrier and hoping they don't drop it, you put it in your own outbox first. The letter is safe in your possession. When the mail carrier comes, they take it from your outbox and deliver it. If the carrier doesn't show up today, the letter is still in your outbox for tomorrow. You've guaranteed the letter will eventually be delivered without needing to hand it to the carrier atomically.

Writing to the Outbox

The Outbox Pattern works by treating the event as just another database row, written in the same transaction as your business data.

Your database transaction now does two things atomically:

  1. Insert the business record (the order, payment, user, etc.)
  2. Insert a row into the outbox table describing the event to be published
BEGIN;
  INSERT INTO orders (id, customer_id, status, ...)
    VALUES ($1, $2, 'pending', ...);
  
  INSERT INTO outbox (event_type, payload, created_at, published)
    VALUES ('OrderCreated', '{"orderId": "8812", ...}', NOW(), false);
COMMIT;

If the transaction commits, both the order and the outbox row exist. If the transaction rolls back (for any reason), neither exists. The atomicity is guaranteed by the database — no distributed transaction needed.

In plain terms: by writing the event to the same database transaction as the business record, you delegate the atomicity problem to your database, which is already very good at it.

Tiny example: Order #8812 is created. In the same transaction:

  • orders table: new row with id=8812, status=pending
  • outbox table: new row with event_type=OrderCreated, payload={"orderId":"8812","customerId":"42","total":129.99}, published=false

Transaction commits. Both rows are permanent. The order exists. The event is ready to be published.

The Relay Process

A separate background process — the relay — periodically queries the outbox table for unpublished rows and publishes each one to the message bus.

// Run every few seconds
const events = await db.query(
  "SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100"
);
for (const event of events) {
  await messageBus.publish(event.event_type, event.payload);
  await db.query("UPDATE outbox SET published = true WHERE id = $1", [event.id]);
}

The relay might run as a dedicated microservice, a cron job, or a thread in the main service. The key properties it needs:

Resilience: if the relay crashes after publishing the event but before marking it as published, it will publish the event again on restart. This is "at-least-once" delivery — consumers must be idempotent (running them twice gives the same result as running them once).

Ordering: if ordering matters, the relay should publish events in created_at order and never process the next event until the previous one is confirmed.

Latency: polling introduces lag. If you poll every 5 seconds, events can be up to 5 seconds late. For most workflows this is acceptable. For near-real-time requirements, use CDC instead of polling.

In plain terms: the relay is the mail carrier that empties your outbox every few seconds. It can be slow, it can retry, it can crash — the letter is always safe in the outbox until it's confirmed delivered.

Marking as Processed

After successfully publishing an event to the message bus, the relay marks it as published (or deletes the row). This prevents re-publishing and keeps the outbox table from growing forever.

The order of operations is critical:

  1. Publish to message bus (if this fails, retry with the row still unpublished)
  2. Mark as published in DB (if this fails after a successful publish, the event will be re-published on next relay run — acceptable, consumers must be idempotent)

This sequence creates at-least-once delivery: you guarantee the event is published at least once, but potentially more. The alternative — marking published before publishing to the bus — creates at-most-once delivery: if the publish fails after the mark, the event is lost.

In plain terms: publish first, then mark done. A duplicate event is recoverable (idempotent consumer handles it). A lost event is a silent inconsistency.

Concrete sketch: The relay publishes "OrderCreated" to Kafka successfully. Before it can mark the row as published, the relay process crashes. On restart, the relay sees the row is still marked published=false and publishes it again. Kafka delivers it twice to the Inventory Service. The Inventory Service, being idempotent, checks: "Did I already process an OrderCreated for order #8812?" — yes, its reservation already exists. It does nothing. No duplicate reservation. No broken state.

CDC Variant: Tailing the Write-Ahead Log

Polling the outbox table works well, but it has overhead: a SELECT query every few seconds plus an UPDATE on each row. For high-throughput systems, this can add noticeable database load.

Change Data Capture (CDC) eliminates polling. Instead of querying the outbox table, a CDC tool like Debezium tails the database's write-ahead log (WAL) — the same binary log your replication replicas use. Every INSERT into the outbox table appears in the WAL, and Debezium reads it and publishes to Kafka in near real-time.

Benefits of CDC:

  • Zero polling overhead on the database
  • Near-zero latency (sub-100ms for most setups)
  • The outbox row can be deleted immediately after the publish, keeping the table clean
  • Ordering is guaranteed by the WAL's sequential structure

In plain terms: instead of asking "any new rows?" every 5 seconds, Debezium reads the database's own internal change log — it knows about new rows the moment they're committed.

Concrete sketch: Debezium connects to your PostgreSQL instance, reads the WAL using the pgoutput plugin, and streams every INSERT/UPDATE/DELETE to a Kafka topic. Your outbox INSERT triggers immediately, Debezium publishes it to outbox.events Kafka topic within 50–100ms. Downstream consumers process it. Debezium tracks the WAL position (LSN) so on restart it picks up exactly where it left off — no events missed, no events duplicated at the Kafka-publish level (though consumers still need idempotency).

The Trade-offs

The Outbox Pattern is one of the most broadly useful patterns in distributed systems. Its main cost is operational complexity:

AspectOutbox + CDCOutbox + PollingDual Write (no outbox)
Delivery guaranteeAt-least-onceAt-least-onceNone (silent losses possible)
Publish latency50–100msUp to poll intervalImmediate (when it works)
DB overheadMinimal (WAL already exists)Polling SELECT + UPDATENone
InfrastructureDebezium + Kafka neededJust the relay processNothing extra
SimplicityComplex setupMediumSimple (until it breaks)

The polling variant is often the right starting point. It requires no CDC infrastructure, just a background job and an extra table. Add CDC later if you need lower latency or higher throughput.

When to use the Outbox Pattern:

  • Any time you write to a database and publish an event as part of the same logical operation
  • SAGA choreography: each service publishes to the event bus from the outbox, not directly
  • Notification systems: user signup creates a row and a "SendWelcomeEmail" event atomically
  • Integration with external systems: write the order, publish to a third-party webhook outbox

When not to bother:

  • Fire-and-forget notifications where occasional loss is acceptable
  • Operations that are pure database reads (no writes = no outbox needed)

Why this matters for you

The dual-write bug is one of the most common data consistency issues in production microservices. It's easy to miss in code review ("we write to the DB and then publish — what could go wrong?") and shows up as subtle, intermittent data divergence between services. The Outbox Pattern is the standard mitigation. Debezium + Kafka is the production-grade implementation you'll see at large scale. A simple polling relay is the practical starting point for most teams. Either way, the underlying principle — write the event atomically with the business data, then deliver it separately — is what you want to internalize.

Next section: MESSAGING & EVENTS — Message Queues, Pub/Sub, and Event-Driven Architecture.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Dual-write problem: writing to the DB and publishing an event are two separate operations — one can fail silently.