Hubbry Logo
Raft (algorithm)Raft (algorithm)Main
Open search
Raft (algorithm)
Community hub
Raft (algorithm)
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Raft (algorithm)
Raft (algorithm)
from Wikipedia
Raft
The Raft consensus algorithm mascot.
ClassConsensus algorithm

Raft is a consensus algorithm designed as an alternative to the Paxos family of algorithms. It was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features.[1] Raft offers a generic way to distribute a state machine across a cluster of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions. It has a number of open-source reference implementations, with full-specification implementations in Go, C++, Java, JavaScript, and Scala.[2] It is named after Reliable, Replicated, Redundant, And Fault-Tolerant.[3]

Raft is not a Byzantine fault tolerant (BFT) algorithm; the nodes trust the elected leader.[1]

Basics

[edit]

Raft achieves consensus via an elected leader. A server in a raft cluster is either a leader or a follower, and can be a candidate in the precise case of an election (leader unavailable). The leader is responsible for log replication to the followers. It regularly informs the followers of its existence by sending a heartbeat message. Each follower has a timeout (typically between 150 and 300 ms) in which it expects the heartbeat from the leader. The timeout is reset on receiving the heartbeat. If no heartbeat is received the follower changes its status to candidate and starts a leader election.[1][4]

Approach of the consensus problem in Raft

[edit]

Raft implements consensus by a leader approach. The cluster has one and only one elected leader which is fully responsible for managing log replication on the other servers of the cluster. It means that the leader can decide on new entries' placement and establishment of data flow between it and the other servers without consulting other servers. A leader leads until it fails or disconnects, in which case surviving servers elect a new leader.

The consensus problem is decomposed in Raft into two relatively independent subproblems listed down below.

Leader election

[edit]

When the existing leader fails or when the algorithm initializes, a new leader needs to be elected.

In this case, a new term starts in the cluster. A term is an arbitrary period of time on the server for which a new leader needs to be elected. Each term starts with a leader election. If the election is completed successfully (i.e. a single leader is elected) the term keeps going with normal operations orchestrated by the new leader. If the election is a failure, a new term starts, with a new election.

A leader election is started by a candidate server. A server becomes a candidate if it receives no communication by the leader over a period called the election timeout, so it assumes there is no acting leader anymore. It starts the election by increasing the term counter, voting for itself as new leader, and sending a message to all other servers requesting their vote. A server will vote only once per term, on a first-come-first-served basis. If a candidate receives a message from another server with a term number larger than the candidate's current term, then the candidate's election is defeated and the candidate changes into a follower and recognizes the leader as legitimate. If a candidate receives a majority of votes, then it becomes the new leader. If neither happens, e.g., because of a split vote, then a new term starts, and a new election begins.[1]

Raft uses a randomized election timeout to ensure that split vote problems are resolved quickly. This should reduce the chance of a split vote because servers won't become candidates at the same time: a single server will time out, win the election, then become leader and send heartbeat messages to other servers before any of the followers can become candidates.[1]

Log replication

[edit]

The leader is responsible for the log replication. It accepts client requests. Each client request consists of a command to be executed by the replicated state machines in the cluster. After being appended to the leader's log as a new entry, each of the requests is forwarded to the followers as AppendEntries messages. In case of unavailability of the followers, the leader retries AppendEntries messages indefinitely, until the log entry is eventually stored by all of the followers.

Once the leader receives confirmation from half or more of its followers that the entry has been replicated, the leader applies the entry to its local state machine, and the request is considered committed.[1][4] This event also commits all previous entries in the leader's log. Once a follower learns that a log entry is committed, it applies the entry to its local state machine. This ensures consistency of the logs between all the servers through the cluster, ensuring that the safety rule of Log Matching is respected.

In the case of a leader crash, the logs can be left inconsistent, with some logs from the old leader not being fully replicated through the cluster. The new leader will then handle inconsistency by forcing the followers to duplicate its own log. To do so, for each of its followers, the leader will compare its log with the log from the follower, find the last entry where they agree, then delete all the entries coming after this critical entry in the follower log and replace it with its own log entries. This mechanism will restore log consistency in a cluster subject to failures.

Safety

[edit]

Safety rules in Raft

[edit]

Raft guarantees each of these safety properties:

  • Election safety: at most one leader can be elected in a given term.
  • Leader append-only: a leader can only append new entries to its logs (it can neither overwrite nor delete entries).
  • Log matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
  • Leader completeness: if a log entry is committed in a given term then it will be present in the logs of the leaders since this term.
  • State machine safety: if a server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log.

The first four rules are guaranteed by the details of the algorithm described in the previous section. The State Machine Safety is guaranteed by a restriction on the election process.

State machine safety

[edit]

This rule is ensured by a simple restriction: a candidate can't win an election unless its log contains all committed entries. In order to be elected, a candidate has to contact a majority of the cluster, and given the rules for logs to be committed, it means that every committed entry is going to be present on at least one of the servers the candidates contact.

Raft determines which of two logs (carried by two distinct servers) is more up-to-date by comparing the index term of the last entries in the logs. If the logs have a last entry with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

In Raft, the request from a candidate to a voter includes information about the candidate's log. If its own log is more up-to-date than the candidate's log, the voter denies its vote to the candidate. This implementation ensures the State Machine Safety rule.

Follower crashes

[edit]

If a follower crashes, AppendEntries and vote requests sent by other servers will fail. Such failures are handled by the servers trying indefinitely to reach the downed follower. If the follower restarts, the pending requests will complete. If the request has already been taken into account before the failure, the restarted follower will just ignore it.

Timing and availability

[edit]

Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm:

broadcastTime << electionTimeout << MTBF

  • broadcastTime is the average time it takes a server to send a request to every server in the cluster and receive responses. It is relative to the infrastructure used.
  • MTBF (Mean Time Between Failures) is the average time between failures for a server. It is also relative to the infrastructure.
  • electionTimeout is the same as described in the Leader Election section. It is something the programmer must choose.

Typical numbers for these values can be 0.5 ms to 20 ms for broadcastTime, which implies that the programmer sets the electionTimeout somewhere between 10 ms and 500 ms. It can take several weeks or months between single server failures, which means the values are sufficient for a stable cluster.

Cluster Membership Changes

[edit]

To address cluster membership change issues that can arise in Raft, the algorithm introduces joint consensus, a transitional configuration phase. Joint consensus works as follows:[5]

Given server configuration Cold, the old server configuration, and Cnew, the new configuration:[5]

  • Log entries are committed to every server in Cold and Cnew.[5]
  • Any server from Cold and Cnew can be leader.[5]
  • Agreement for elections and log entry commits requires majorities from both Cold and Cnew.[5]

Once the joint consensus is done (that is, the special Cnew configuration entry is replicated to a majority of Cnew servers' logs), the system fully transitions to the new configuration.[5]

However, there are three issues that arise with this new configuration, of which Raft addresses:[6]

  1. New servers with no log entries. Raft introduces a phase before the configuration change where servers with no log entries are not considered part of the majority in elections, but they will have entries replicated to them. This happens until the server is fully caught up with entries.[6]
  2. Cluster leader isn't in Cnew. The cluster leader will stop being leader and return to follower state. Specifically, it will continue to replicate log entries, but will not count itself as majorities.[6]
  3. Disruptions from servers in Cold. If a server believes that a current leader exists, any RequestVote RPCs (RPCs to gather votes for a leader election) will be disregarded.[6]

Log Compaction

[edit]

An extension to Raft is log compaction, where each server takes snapshots of its committed entries and saves it to stable storage, along with the index of its last entry and the term in the snapshotted log. The leader occasionally sends its snapshots to servers that are lagging behind in its log. When the server receives this snapshot, it will either discard its entire log if it's superseded by the snapshot, or only the entries up to the latest one in the snapshot.[7]

Issues

[edit]

While Raft aims to be an alternative of Paxos and more understandable than Paxos, several issues arise.

Leader Bottleneck

[edit]

Raft uses a single leader model, where client requests, reads and writes, and log replications goes through a single leader. This means that there is a single point of failure and a bottleneck in performance. Furthermore, it does not scale with increasing server workload.[8]

Reconfiguration

[edit]

Raft's membership change system has not been formally verified to be correct. This means that implementing it is very risky, as there can be many potential bugs and errors. Diego Ongaro, one of the co-authors, has tried to create a formal safety proof, but there are no plans to continue developing it due to how complicated it is.[9] In 2014, a safety bug was found relating to single server membership changes.[9]

Byzantine Faults

[edit]

Raft, like other consensus algorithms, ensures that there can never be an incorrect result under all non-Byzantine conditions.[10] This means that Raft is not a Byzantine fault tolerant algorithm. A 2023 study found that blockchain systems based on Raft are vulnerable to Byzantine attacks because of the lack of authentication on the client side.[11]

Extensions

[edit]

The dissertation “Consensus: Bridging Theory and Practice” by one of the co-authors of the original paper describes extensions to the original algorithm:[12]

  • Pre-Vote: when a member rejoins the cluster, it can depending on timing trigger an election although there is already a leader. To avoid this, pre-vote will first check in with the other members. Avoiding the unnecessary election improves the availability of cluster, therefore this extension is usually present in production implementations.
  • Leadership transfer: a leader that is shutting down orderly can explicitly transfer the leadership to another member. This can be faster than waiting for a timeout. Also, a leader can step down when another member would be a better leader, for example when that member is on a faster machine.

Production use of Raft

[edit]
  • CockroachDB uses Raft in the Replication Layer.[13]
  • Etcd uses Raft to manage a highly-available replicated log [14]
  • Hazelcast uses Raft to provide its CP Subsystem, a strongly consistent layer for distributed data structures. [15]
  • IBM MQ uses Raft to manage a highly-available replicated log. [16]
  • MongoDB uses a variant of Raft in the replication set.
  • Neo4j uses Raft to ensure consistency and safety. [17]
  • RabbitMQ uses Raft to implement durable, replicated FIFO queues. [18]
  • ScyllaDB uses Raft for metadata (schema and topology changes) [19]
  • Splunk Enterprise uses Raft in a Search Head Cluster (SHC) [20]
  • TiDB uses Raft with the storage engine TiKV.[21]
  • YugabyteDB uses Raft in the DocDB Replication [22]
  • ClickHouse uses Raft for in-house implementation of ZooKeeper-like service [23]
  • Redpanda uses the Raft consensus algorithm for data replication [24]
  • Apache Kafka Raft (KRaft) uses Raft for metadata management.[25]
  • NATS Messaging uses the Raft consensus algorithm for Jetstream cluster management and data replication [26]
  • Camunda uses the Raft consensus algorithm for data replication [27]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Raft is a consensus algorithm for managing a replicated log in distributed systems, designed to ensure that multiple servers agree on the committed state of the log despite failures such as network partitions, server crashes, or message losses. Developed by Diego Ongaro and at , Raft was introduced in 2014 as an alternative to the more complex algorithm, prioritizing understandability while maintaining equivalent fault-tolerance and performance guarantees. It achieves consensus through a strong leader model, where a designated leader handles all client requests and replicates log entries to followers, ensuring and safety in asynchronous environments. Raft decomposes the consensus problem into three key subproblems: , log replication, and safety, which simplifies implementation and reasoning compared to monolithic approaches like . uses randomized timers to select a leader quickly and avoid prolonged splits, while log replication ensures followers apply entries in order only after acknowledgment from a majority of the cluster. Safety mechanisms, such as log matching and commit rules, prevent inconsistencies in the event of crashes affecting a minority of the servers. This modular design has made Raft widely adopted in production systems, including etcd for , by , and . The algorithm's emphasis on clarity was validated through a user study involving 43 students, where participants learned and implemented more accurately and with fewer errors than . Extensions in Ongaro's dissertation address cluster membership changes via a joint consensus method and incorporate features like log compaction for efficiency. Overall, serves as a foundational protocol for building fault-tolerant, scalable distributed applications, bridging theoretical consensus with practical engineering.

Introduction

Overview of Raft

Raft is a consensus designed for managing replicated logs in distributed systems, enabling the reliable coordination of replicated state machines across multiple nodes to ensure consistent operation despite node failures or network partitions. It achieves by replicating client-submitted commands in a log that all nodes can apply in the same order, producing results equivalent to those of the algorithm while maintaining comparable efficiency. The primary design goals of Raft emphasize understandability and simplicity, particularly in contrast to the more complex protocol, to facilitate easier implementation and reasoning in practical systems. To this end, decomposes the consensus problem into three main subproblems—leader election, log replication, and safety—allowing developers to address each independently while ensuring overall correctness. This modular structure reduces the state space and avoids subtle edge cases, making the algorithm more accessible for building reliable distributed applications. Raft operates on a cluster of nodes, where each node assumes one of three roles: a leader, which handles all client interactions and log replication; a follower, which passively responds to leader requests; or a candidate, which temporarily seeks election as leader. Time in the system is structured into discrete terms, representing election cycles that begin with a potential leader election and may result in a new term if no leader is chosen. Nodes communicate primarily through two remote procedure calls (RPCs): RequestVote, used during elections to solicit votes, and AppendEntries, employed by the leader for replicating log entries and sending periodic heartbeats to maintain follower liveness. In its high-level workflow, Raft first elects a unique leader via a randomized to avoid prolonged contention, ensuring that only one node coordinates the cluster at a time. Clients submit commands to this leader, which appends them as log entries and replicates the log to followers for agreement. Upon acknowledgment from a of nodes, entries are committed and applied to the state machines, guaranteeing linearizable consistency across the system.

Historical Development and Motivation

Raft was developed by Diego Ongaro and at during Ongaro's PhD research, with the algorithm first published in a paper at the 2014 USENIX Annual Technical Conference (ATC), where it received the Best Paper Award. The work aimed to address longstanding challenges in distributed consensus protocols by prioritizing understandability without sacrificing correctness or efficiency. The motivation for Raft stemmed from the complexities of prior consensus algorithms, particularly , which, while mathematically sound, proved difficult to implement correctly and teach effectively due to its interleaved phases and abstract terminology. Ongaro and Ousterhout sought to create an alternative that separated key concerns—such as from log replication—into modular components, fostering a clearer for developers and students building practical systems like replicated databases. This design philosophy was informed by observations that often led to errors in real-world applications, prompting a focus on educational value alongside operational simplicity. Among Raft's key innovations are randomized timeouts for leader elections, which resolve conflicts probabilistically to ensure quick convergence, and heartbeat-based leader maintenance, which simplifies failure detection over more intricate voting schemes. These elements, combined with a strong leader model where replication flows unidirectionally from leader to followers, emphasize reliability in fault-tolerant environments. Initially, Raft gained traction through a controlled user study involving 43 students, where 33 participants scored higher on a quiz about Raft than on one about Paxos, and participants made fewer errors when implementing Raft, highlighting its pedagogical advantages. Its intuitive structure led to rapid adoption in academia for teaching distributed systems and in industry, with over 100 open-source implementations powering tools like etcd, HashiCorp Consul, and TiKV.

Core Algorithm

Leader Election Process

In Raft, servers operate in one of three states: follower, candidate, or leader. Followers remain passive, responding to requests from candidates and leaders but initiating no actions themselves. Candidates actively seek election, while leaders handle all client interactions and replicate log entries to followers. State transitions are primarily triggered by timeouts and remote procedure calls (RPCs). A follower converts to a candidate upon expiration of its election timeout, a candidate becomes a leader upon securing a majority of votes, and both candidates and leaders revert to followers if they receive an RPC from a server in a higher term. The election process begins when a follower detects no valid communication—neither an AppendEntries RPC from a leader nor a granted vote—from another server within its randomized election timeout period, typically between 150 and 300 milliseconds. Upon timeout, the follower immediately transitions to candidate state, increments its current term, casts a vote for itself, and issues RequestVote RPCs to all other servers in the cluster. The RequestVote RPC includes the candidate's term, ID, last log index, and last log term to allow recipients to verify the candidate's log is at least as up-to-date as their own. A receiving server grants its vote if the RPC's term is at least as high as its own, it has not yet voted in the current term, and its log is not more up-to-date than the candidate's. A candidate wins the and assumes if it obtains votes from a strict () of servers in the cluster during its term. This rule ensures at most one leader per term, as votes are granted only once per term and ties are resolved by the higher term number. If no candidate achieves a —due to split votes or network partitions—the fails, and followers will timeout again, starting a new term with fresh candidates. Once elected, the leader maintains its authority by periodically sending heartbeat messages to followers, implemented as AppendEntries RPCs containing no log entries. These heartbeats, sent at intervals shorter than the election timeout (e.g., every 10–50 milliseconds), reset followers' election timers and prevent them from initiating new elections. The leader steps down to follower state if it receives any RPC (RequestVote or AppendEntries) bearing a term higher than its own, allowing a new leader to emerge. To minimize the risk of split votes, where multiple candidates receive partial support and no forms, Raft employs randomized election timeouts. The uniform random distribution of these timeouts (150–300 ms) ensures that, in a healthy cluster, servers' timeouts are unlikely to align, allowing one candidate to typically gather a before others start elections. In cases of split votes, the short timeout range enables rapid resolution through subsequent elections, usually within one to two additional timeouts.

Log Replication Mechanism

In Raft, the replicated log is a central data structure that stores a sequence of state machine commands, ensuring that all servers in the cluster eventually apply the same commands in the same order. Each log entry contains three fields: an index denoting its position in the log, a term number indicating the leader's term when the entry was created, and the actual client command to be executed. The term field in each entry serves as a critical consistency check, allowing servers to detect and resolve discrepancies between logs without relying on physical . When a client submits a command, it is first received by the current leader, which appends the command as a new entry to the end of its log without immediately applying it to its state machine. The leader then issues AppendEntries remote procedure calls (RPCs) in parallel to all follower servers to replicate the new entry. Each AppendEntries RPC includes the new log entry, the index and term of the log entry immediately preceding the new one (prevLogIndex and prevLogTerm), the leader's commit index, and the leader's current term. This prevLogIndex and prevLogTerm allow followers to verify that their logs are consistent with the leader's up to the point of replication. Upon receiving an AppendEntries RPC, a follower first checks if its log's entry at prevLogIndex matches the provided prevLogTerm; if it does not, or if the prevLogIndex exceeds the follower's log length, the follower rejects the RPC and responds with a failure, prompting the leader to decrement its nextIndex for that follower and retry with an earlier log position. If the consistency check passes, the follower appends any new entries from the RPC that do not already exist in its log, overwriting any conflicting entries beyond prevLogIndex with the same index but a different term. The follower then responds with success if the append succeeds, including its new matchIndex (the highest log index it has replicated from the leader), or failure otherwise. This process ensures that logs remain consistent through the log matching property, where all committed entries share the same index and term across servers. An entry becomes committed once it is replicated to a of servers in the cluster, at which point the leader can safely apply it to its local state machine. The leader maintains a commitIndex, which it advances to the highest log index N such that the entry at N is from the current term and a of followers have a matchIndex of at least N (specifically, if there exists an N > commitIndex where log[N].term == currentTerm and a of matchIndex >= N, then set commitIndex = N). Upon updating commitIndex, the leader applies all entries up to this index to its state machine in sequence and notifies the client of the commitment. Followers similarly apply committed entries to their state machines once they learn of the leader's commitIndex through subsequent AppendEntries RPCs, ensuring ordered execution without gaps. To maintain log integrity, leaders adhere to an rule: they never overwrite or delete entries in their own logs, only appending new ones, which preserves the consistency of previously committed entries even across leader changes. The leader tracks a nextIndex value for each follower, initialized to the follower's log length plus one, and decrements it upon receiving a rejection to resend earlier entries until consistency is restored. This mechanism allows efficient replication while handling network partitions or failures, with the leader periodically sending heartbeat AppendEntries (empty RPCs) to maintain authority and replicate empty batches if needed.

Client Interaction and Command Processing

In Raft, clients interact with the distributed system by submitting commands to any server in the cluster, without needing prior knowledge of the current leader. If a client sends a command to a non-leader server, such as a follower, the server responds with a redirect message containing the network address of the known leader, obtained from recent AppendEntries RPCs. This redirection mechanism ensures efficient routing while minimizing client complexity. Upon receiving a command from a client, the leader first validates the request and appends it as a new entry to its replicated log. The leader then replicates this entry to a of followers using the AppendEntries RPC, as described in the log replication process. Once the entry is safely replicated and committed—meaning it has been appended to a of servers' logs—the leader applies it to its state machine and responds to the client with the result. To support idempotent execution and prevent duplicate processing, clients typically include unique serial numbers or identifiers with their commands, which the leader's state machine uses to track and discard duplicates. If the leader fails during command processing, the client may experience a timeout and will retry the command by sending it to a randomly selected server in the cluster. Upon a new , the client retries ensure the command is reprocessed correctly, leveraging the idempotency mechanisms to avoid inconsistencies. This retry approach handles leader changes transparently from the client's perspective. For read-only operations, the leader can provide immediate responses to ensure linearizability by first checking that its term remains current (i.e., no higher term has been observed) and, if necessary, committing a no-op entry to the log to confirm authority through heartbeats with a of followers. Alternatively, to achieve linearizable reads without log writes, the leader may query a of followers to confirm its status before responding. Clients encounter error responses primarily through timeouts when the leader is unavailable or unresponsive, prompting retries, or via explicit redirects from non-leaders when contacting the wrong server. These mechanisms maintain availability while guiding clients to the appropriate processing node.

Safety Properties

Leader Election Safety

Raft ensures that at most one leader is elected per term through its voting mechanism, which requires a candidate to secure a majority of votes from the cluster's servers. This majority quorum property guarantees the uniqueness of the leader because any two quorums in a cluster of NN servers (where NN is typically odd and greater than or equal to 3) must intersect in at least one server. If two candidates were to both claim leadership in the same term, that overlapping server would have cast votes for both, which is impossible under Raft's rules: each server votes for at most one candidate per term and increments the term number upon receiving a higher-term request, invalidating prior votes. A proof sketch of this invariant proceeds by contradiction. Suppose two leaders, L1L_1 and L2L_2, exist in term TT. Each must have received votes from a Q1Q_1 and Q2Q_2, respectively. Since Q1Q21|Q_1 \cap Q_2| \geq 1, there exists a server SS in both quorums. Server SS would have granted a vote to both L1L_1 and L2L_2 in term TT, but Raft prohibits revoting within the same term and requires votes only for candidates with logs at least as current as the voter's. Thus, no such dual leadership is possible, preserving system consistency. Furthermore, Raft enforces that an elected leader's log contains all committed entries from prior terms. During the RequestVote RPC, a voter's server compares its log with the candidate's: it grants a vote only if the candidate’s log is at least as up-to-date as the voter’s log. This means the candidate’s last log term is greater than the voter’s, or the terms are equal and the candidate’s last log index is at least as large as the voter’s. This check, combined with the majority vote, ensures the new leader has a complete and valid history, preventing incomplete leaders from taking over and maintaining causal consistency across terms. Elections may result in brief periods without a leader, during which the cluster cannot process new commands, but these intervals are minimized by randomized election timeouts. Typically ranging from 150 to 300 milliseconds, these timeouts stagger candidate starts, reducing the likelihood of prolonged splits and ensuring rapid convergence to a single leader under normal network conditions with no partitions. The cluster only advances its state machine when a leader is active, underscoring the safety of these transient gaps.

Log Replication Safety

Raft's log replication safety guarantees that all servers' logs remain consistent, ensuring that state machines across the cluster apply the same commands in the same order, even in the presence of failures. This is achieved through a set of invariants that prevent log divergences and enforce ordered application of committed entries. The core mechanisms include the log matching property, leader completeness, and specific commitment rules, all supported by append-only log structures and term-based consistency checks. The log matching property states that if two logs contain an entry with the same index and term, then they are identical in all entries up to that index. This property is enforced during log replication via the AppendEntries RPC, which includes checks for the previous log index and term; if these do not match, the follower rejects the entry, preventing inconsistencies from propagating. As a result, logs cannot diverge in prior entries once they share a common point, maintaining a linear history of commands. Leader completeness ensures that any leader elected after a given term will include all committed entries from previous terms in its log. This is indirectly supported by the leader election process, which favors candidates with more up-to-date logs based on their last log index and term. Consequently, no committed entry is ever lost, as future leaders inherit the full committed prefix of the log. The commitment rule specifies that a log entry is committed once it has been replicated to a majority of servers by the leader that created it. For safety, leaders only commit entries from their own term (advancing the commit index only when a majority acknowledges an entry in the current term), while prior committed entries are carried forward via the log matching property. This prevents committing entries from prior terms that might not have been majority-replicated, avoiding scenarios where uncommitted entries could be applied. State machine safety follows from these properties: once an entry is committed, it will never be overwritten or applied out of order on any server. Servers apply log entries to their state machines strictly in log order and only after commitment, ensuring that all state machines reflect the same sequence of commands. This invariant guarantees for committed operations, as no server can apply a different entry at the same index due to the log matching and leader completeness properties. These safety guarantees rely on fundamental proof elements, including the nature of logs, where leaders never overwrite or delete existing entries, and term checks in the AppendEntries RPC that reject any attempt to append inconsistent entries. Together, these mechanisms ensure that log replication preserves a consistent, ordered history across the cluster without requiring complex .

Failure Handling and Availability

Raft is designed to tolerate a variety of failures, including node crashes and network partitions, while ensuring through its quorum-based mechanisms. The algorithm assumes a non-Byzantine environment where nodes may crash and recover but do not behave maliciously, and it relies on eventual message delivery rather than strict synchrony. This approach allows Raft to maintain progress as long as a of nodes in the cluster are operational and connected, providing up to f failures in a cluster of 2f+1 nodes. When a follower crashes, the leader detects the failure through missed heartbeat responses, as followers are expected to acknowledge periodic AppendEntries RPCs. The leader continues replicating logs to the remaining followers that form a , ensuring that the system remains available without interruption as long as the is intact. Upon recovery, the crashed follower contacts the leader and catches up by requesting log entries starting from its nextIndex value, which the leader maintains to track the follower's replication progress; this incremental catch-up process minimizes disruption and restores full participation without violating log consistency. Leader crashes introduce a brief period of unavailability, during which followers detect the absence of heartbeats and initiate a new by incrementing the term and sending RequestVote RPCs. The election process is expedited by random election timeouts (typically 150-300 ms) to reduce the likelihood of multiple candidates, allowing a new leader to be elected quickly once a responds with votes. During this transition, no new entries are committed, but Raft's safety properties preserve the correctness of previously committed logs, ensuring that the system resumes operation from a consistent state. Network partitions are handled by Raft's quorum requirements, which prevent progress in minority partitions. If a partition splits the cluster such that no majority subgroup can form, no leader can be elected in the minority side, as candidates require votes from a majority to win. When the network heals and the majority partition reconnects with previously isolated nodes, the elected leader in the majority resumes sending heartbeats and replicating logs to the rejoined followers, restoring full cluster availability without needing reconfiguration. Raft operates under asynchronous timing assumptions with bounded but unbounded delays, meaning it does not require synchronized clocks or precise timing bounds for correctness, though election timeouts must be longer than typical message delays to avoid unnecessary elections. This non-Byzantine model ensures availability in partially synchronous networks, where messages are eventually delivered, allowing the system to tolerate temporary asynchrony without halting progress indefinitely. For liveness, Raft guarantees eventual leader election under a stable network where a majority of nodes can communicate reliably, as the increasing term numbers prevent infinite loops and the randomized timeouts ensure that ties are broken efficiently. This mechanism avoids leader starvation, as each term provides a fresh opportunity for election, and once a stable leader is chosen, the cluster sustains continuous operation barring further failures.

Advanced Topics

Cluster Membership Changes

Changing the membership of a cluster, such as adding or removing servers, poses significant challenges to maintaining consistency and , as abrupt changes to the size could allow non-overlapping majorities in old and new configurations to make conflicting decisions, potentially violating properties. To address this, employs a joint consensus method that ensures overlapping majorities during transitions, preventing any unilateral actions by either configuration. The process unfolds in two phases, treating configuration changes as special entries replicated through the log. First, the leader proposes a transition to a joint configuration (Cold,new), which combines the old configuration (Cold) and the new one (Cnew). This entry must be committed by a in both Cold and Cnew, ensuring that the old acknowledges the change while the new servers have caught up sufficiently. Once committed, the cluster operates under joint consensus, where entries require approval from in both configurations to prevent splits. In the second phase, the leader proposes the final Cnew entry, which commits only after receiving a from the new configuration, fully transitioning the cluster. For single-node changes, adding a server begins by replicating existing logs to it as a non-voting follower until it is up to date, after which the leader initiates the joint consensus reconfiguration to promote it to a full voter. Removing a server involves proposing a new configuration that excludes it; once committed, the removed server is ignored in future quorums, and if it attempts to lead, the Leader Completeness Property ensures it only does so with a committed configuration matching the cluster's. This stepwise approach allows complex changes by composing multiple single-node operations. Raft's design guarantees safety by ensuring no point where Cold and Cnew can both make independent decisions, as the joint phase enforces overlapping majorities, and uncommitted configurations cannot lead or commit entries. This mechanism avoids availability gaps during reconfiguration while preserving the algorithm's consistency invariants, even under concurrent failures, provided the new configuration maintains a .

Log Compaction and Cleanup

In distributed systems using Raft, replicated logs can grow indefinitely as client commands are appended and committed, leading to unbounded storage consumption and potential performance degradation during server restarts or recovery. To address this, Raft employs log compaction through a mechanism called snapshotting, where servers periodically capture the current state of the replicated state machine along with committed log entries to stable storage, allowing the discard of obsolete log prefixes. This process ensures efficient storage management while preserving the ability to reconstruct the state machine from snapshots rather than replaying entire logs. Snapshotting in Raft is performed independently by each server when its log reaches a configurable size threshold, focusing solely on committed entries to maintain safety. The snapshot encapsulates the state machine's output after applying all entries up to the last applied index, along with essential metadata such as the last included index and term from the log. This metadata enables consistency checks during subsequent log replication via AppendEntries RPCs. For full-state snapshots—the basic approach in the core algorithm—the entire state machine is serialized; incremental snapshots, which update only changes since the previous snapshot, are possible in extensions but not part of the fundamental protocol. Once created, the server can safely delete log entries and any prior snapshots up to the last included index, effectively cleaning up storage. To propagate snapshots to followers, particularly those lagging behind due to failures or network partitions, the leader initiates the InstallSnapshot RPC. This RPC transmits the snapshot data in configurable chunks, including the metadata, to the follower, which then replaces the prefix of its log up to the last included index with the snapshot and truncates any conflicting or unnecessary preceding entries. Upon receiving all chunks, the follower resets its state machine by replaying the snapshot and resumes normal log replication from the new log base. This installation process ensures followers can catch up without replaying the full history, adhering to Raft's safety guarantees as snapshots are only created from committed logs. The primary benefits of log compaction and cleanup in Raft include bounded disk usage, which prevents storage exhaustion in long-running clusters, and accelerated recovery times during server restarts or state transfers, as the state machine can be restored directly from the most recent snapshot rather than processing potentially millions of log entries. By integrating snapshotting seamlessly with log replication, Raft maintains high availability without compromising consistency, making it suitable for practical deployments where log retention is a concern.

Implementations and Applications

Open-Source Implementations

etcd, first released in June 2013, adopted in version 3.0 (June 2016) for , log replication, and supports cluster membership changes through dynamic reconfiguration. It is a distributed key-value store implemented in Go that serves as the core component for , utilizing its own consensus library to ensure data replication and consistency across nodes. It includes built-in log compaction mechanisms to manage storage efficiently by generating snapshots and purging old log entries, preventing unbounded growth in the replicated log. As of May 2025, etcd v3.6.0 introduces enhancements for better stability and scalability in environments. Consul, developed by in Go, is a service discovery and configuration tool that employs a dedicated Raft implementation for its key-value store to maintain replicated state across servers. This setup enables Consul to handle service registration, health checks, and configuration propagation reliably in distributed environments. A key feature is its support for multi-datacenter deployments, where Raft ensures intra-datacenter consensus while WAN gossip protocols facilitate federation across datacenters for global service discovery. TiKV, the distributed transactional key-value storage engine for the database, is implemented in and leverages to replicate data across regions for and consistency. Designed for large-scale deployments handling petabyte-level data, TiKV optimizes for through techniques like multi-Raft groups, where each shard (region) operates as an independent replica set to enable horizontal scaling. It integrates closely with the Placement Driver (PD) component, which uses itself to manage metadata and direct region placement, scheduling, and load balancing across the cluster. Dragonboat is an embeddable, high-performance multi-group consensus library written in Go, emphasizing throughput and low latency for applications requiring replicated state machines. It supports multiple concurrent Raft groups on the same nodes, allowing efficient resource sharing while isolating consensus operations. Benchmarks demonstrate its capability, achieving up to 9 million writes per second for 16-byte payloads on a three-node cluster with , focusing on optimizations like pipelined log replication and custom transport layers to maximize I/O throughput. Many open-source Raft implementations share common configurable parameters, such as election and heartbeat timeouts, to adapt to varying network conditions and cluster sizes. They often expose metrics for monitoring leader stability, replication lag, and throughput via integrations like . Snapshot formats vary, with etcd using a custom binary format for efficient , while others like Dragonboat support pluggable snapshotters for flexibility in state persistence.

Production Deployments and Case Studies

Raft has been widely adopted in production environments for ensuring and consistency in distributed systems. In , etcd serves as the primary data store for the API server's state, managing cluster configuration, node registrations, and pod scheduling across potentially thousands of nodes. Raft's and log replication mechanisms in etcd provide , allowing the cluster to remain operational as long as a of etcd nodes are available, even during node failures or network partitions. This setup has enabled to scale reliably in large-scale deployments, such as those running in providers supporting multi-region operations. CockroachDB, a distributed SQL database, employs independently for each key range to replicate data and maintain transactions across geographically distributed nodes. By applying per range, CockroachDB achieves geo-replication, where data is synchronously or asynchronously mirrored across regions to minimize latency while preserving consistency. In production, this approach supports resilient workloads, such as requiring low-latency queries over global datasets, with handling range splits and merges to balance load. InfluxDB, a time-series database, utilized in its Enterprise v1.x clustering (as of 2021) for coordinating meta nodes that manage shared cluster information, including databases and retention policies. This enabled for write-heavy workloads, where ensured that metadata updates were consistently replicated across an odd number of meta nodes (typically three) to form a . In InfluxDB 3.0 and later, in multi-node clusters is achieved through shared (e.g., AWS S3) without . In production setups, such as monitoring systems processing millions of metrics, the earlier -based coordination sustained high write throughput without compromising . Deploying Raft in production introduces challenges, particularly in tuning parameters for varying network conditions. Election timeouts must be carefully adjusted—often to 150-300 milliseconds in low-latency environments but extended for high-latency geo-distributed setups—to prevent unnecessary leader that could degrade . Monitoring for split-brain risks is essential, as network partitions can lead to temporary leader elections; Raft mitigates this through randomized timeouts and majority quorums, but operators must implement alerts for prolonged partitions to avoid issues. Performance in production Raft deployments varies by hardware, network, and workload but typically achieves 10,000 to 100,000 operations per second in sustained benchmarks. For instance, etcd in clusters delivers around 10,000 writes per second on standard hardware, while ranges can exceed 1 million operations per second in optimized multi-node setups. Case studies demonstrate high reliability, with systems like etcd maintaining 99.99% uptime through Raft's fault-tolerant design, even under partial failures.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.