078 · RAFT · LOG · LEADER

Raft Algorithm

A consensus algorithm designed to be understandable — used in etcd and CockroachDB.

If you are new here: Raft is a consensus algorithm invented in 2013 by Diego Ongaro and John Ousterhout with one explicit design goal: "understandability." It solves the same problem as Paxos — getting a cluster of servers to agree on a sequence of values even when nodes crash — but decomposes the problem into three clearly separated sub-problems: leader election, log replication, and safety. Today, Raft is the algorithm inside etcd (the key-value store that Kubernetes depends on), CockroachDB, TiKV, and many other production systems.

TermPlain meaning
Replicated logAn identical ordered list of commands on every server — the shared source of truth
LeaderThe one server that handles all writes and sends log entries to followers
FollowerA server that copies the leader's log and votes in elections
CandidateA follower running for election after a leader timeout
TermA monotonically increasing generation number — incremented every election
AppendEntriesThe RPC a leader sends to replicate log entries (also used as heartbeats)
Commit indexThe index up to which entries are durably agreed upon by a majority
SnapshotA compressed state image that replaces old log entries

The Problem

Your application has three database replicas. A user submits a write. You want all three replicas to apply this write — and every other write — in exactly the same order. If they apply writes in different orders, they'll diverge: replica 1 might have account balance $900, replica 2 might have $800, and you have a correctness disaster.

This is the replicated state machine problem. If you can guarantee that every replica starts in the same state and applies the same sequence of commands in the same order, all replicas will always be in the same state. Raft is the protocol that provides that guarantee under arbitrary crash failures.

Without a protocol like Raft, you get split-brain (two nodes simultaneously accepting writes) or data loss (promoting a follower that missed some writes). With Raft, you get automatic failover that's provably safe — the new leader will always have all committed entries, and followers that are behind will always catch up to exactly the same state.

In plain terms: Raft is a rulebook for three replicas to behave as one reliable database, even when individual nodes crash and recover.

Analogy: Think of a judge reading out court decisions to a panel of three clerks who each maintain the official record. Only the judge (leader) announces decisions. All three clerks write down exactly what the judge says, in the exact order said. If the judge falls ill and a new judge takes over, the incoming judge first reviews the last few decisions recorded across all three clerks' notes to make sure they're identical before issuing new rulings. Any clerk who missed a decision gets it read to them before the new judge continues.

Leader Appends Entries

All client writes go to the leader. The leader appends the new log entry to its own log and then simultaneously sends AppendEntries(entry) RPCs to all followers. At this point, the entry is in the leader's log but not yet committed — followers have it tentatively, not durably.

The AppendEntries RPC carries more than just the new entry. It also includes:

  • The leader's current term (to detect stale leaders)
  • The index and term of the log entry immediately preceding the new one (a consistency check)
  • The leader's current commit index (so followers know what they can apply to their state machine)

In plain terms: every write is a two-step process — first spread it (AppendEntries), then commit it (once you've heard back from enough followers).

Tiny example: Three servers. Client sends SET balance=900. Leader appends it as entry #47, sends AppendEntries(#47) to Follower 1 and Follower 2. All three now have [..., #47: SET balance=900] in their logs, marked as uncommitted. No read of the current balance has changed yet.

Committing on Majority Acknowledgment

Once the leader receives acknowledgments from a majority (⌊N/2⌋ + 1) of the cluster for the new log entry, it marks that entry as committed and applies it to its state machine. It then notifies followers of the new commit index in the next heartbeat or AppendEntries message, and followers apply the entry to their state machines too.

This is the core safety guarantee: an entry is committed — and therefore durable — only when a majority of servers have it in their logs. Since any two majorities must share at least one server, any future leader elected from this cluster will have this entry, ensuring committed entries are never lost.

In plain terms: the majority acknowledgment rule means "this write is safe even if any single server crashes right now." It's the mathematical minimum needed to make that claim.

Concrete sketch: 5-node cluster. Leader sends AppendEntries for entry #47. Followers 1 and 2 reply with success. That's 3 out of 5 (leader + 2 followers) — a majority. Leader commits #47, responds to the client: "Write succeeded." Even if Followers 3 and 4 crash before receiving the entry, #47 is safe because any future leader elected from will have it.

What if followers are slow? The leader doesn't wait for all followers — just a majority. Slow followers will receive the committed entries in future AppendEntries messages and catch up. They just can't participate in commits until they're caught up. The system degrades gracefully rather than blocking.

Log Catch-Up After Leader Change

When a new leader is elected, it may find followers with different log states. Some might be far behind (they were partitioned or just rebooted). Some might even have extra uncommitted entries from a previous leader that crashed before committing them.

Raft handles this through the nextIndex mechanism. The new leader initializes nextIndex[follower] = last log index + 1 for each follower. When a follower rejects an AppendEntries message (because the consistency check fails — the preceding entry doesn't match), the leader decrements nextIndex and retries. Eventually, the leader finds the point where the follower's log matches, and sends all missing entries from that point forward.

Uncommitted entries on a stale follower are simply overwritten. Since they were never committed (never acknowledged by a majority), this is safe — no client was told these writes succeeded.

In plain terms: the new leader patiently replays its log to every follower from the last point of agreement. Stale or wrong entries are overwritten. Committed entries are never overwritten.

Analogy: A new school principal inherits filing cabinets from three vice principals. Each VP kept notes slightly differently during the previous principal's absence. The new principal compares all three sets of notes, finds the last entry they all agree on, and then distributes the official record from that point forward. Anything that wasn't officially approved gets replaced.

Snapshots and Log Compaction

Raft's replicated log grows forever unless you do something about it. A service that processes a thousand writes per second will accumulate 86 million log entries per day. On restart, replaying the entire log from the beginning would take hours.

Raft addresses this with snapshots: periodically, a node takes a complete snapshot of its current state machine state, writes it to stable storage, and discards all log entries before the snapshot's log index. On restart, it loads the snapshot and only replays entries after it. When a new node joins or a stale follower is very far behind, the leader sends the snapshot directly (via InstallSnapshot RPC) rather than replaying thousands of individual entries.

In plain terms: snapshots are like a save game — you capture the complete current state so you don't have to replay every move from the beginning.

Tiny example: A Kubernetes cluster's etcd has processed 500,000 write operations since startup. Without compaction, replaying them all on restart takes minutes. With compaction, etcd periodically takes a snapshot at the latest commit index and discards older entries. Restart time drops from minutes to seconds.

etcd's default compaction behavior: compact every 5 minutes or every 1,000 revisions, whichever comes first.

The Trade-offs

Raft was designed to win on understandability, not raw performance. The strong-leader model simplifies reasoning but creates a bottleneck: all writes flow through one leader, all reads (for linearizability) also need the leader.

AspectRaftMulti-Paxos
UnderstandabilityExcellent — separate election, replication, safetyHarder — phases blur together
Write throughputBounded by one leaderCan pipeline across multiple leaders (more complex)
Read linearizabilityLeader lease or ReadIndex checkSimilar approaches available
Practical implementationsetcd, CockroachDB, TiKV, ConsulGoogle Chubby, Zab (ZooKeeper variant)

For most teams, Raft is the right choice because off-the-shelf Raft implementations (etcd, Consul) are mature, well-tested, and operationally familiar. The throughput limits only become relevant at very high write volumes — orders of magnitude higher than most services ever reach on their coordination plane.

Why this matters for you

Every Kubernetes cluster depends on a Raft consensus group (etcd) for all cluster state. If you've ever seen "etcd leader election" in your Kubernetes events, you've witnessed Raft at work. When you're designing a distributed system that needs coordination — feature flags, distributed locks, leader election, configuration management — you're choosing between building on top of Raft (via etcd or Consul) or inventing something weaker. In almost every case, building on Raft is the right choice. Understand what Raft guarantees (linearizability, safe failover) and what it doesn't (availability under majority loss), and you'll make much better architectural decisions.

Next: Gossip Protocol — a different paradigm for spreading information across large clusters without any central coordinator.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Goal: every node applies the same log entries in the same order — identical state machines.