How Raft Works
Raft is a simple consensus algorithm that is designed by these principles:
- Dividing a problem into separate pieces that could be solved easily.
- Majority acceptance to avoid inconsistency.
- Simple logic controls safety.
The most valuable thing in Raft is not only the algorithm itself, but the method of dividing problems. In this article, I will try to be an imposter to imitate this technology.
In a distributed system, the simplest way for keeping data consistent is to use a strong leader and duplicate every writing operation to other servers. The operation can be ordered by time as log, and appending each log entry forward. The leader has permission for the order of appending, the followers copy the same order of log entry in its own. For further simplicity, after the log entries have been committed by majority servers, it is never overwrote. We just need to consider two problems:
Who can become the leader?
When the leader crashed, how to handle inconsistent states?
Raft election is like the real election, it elected a leader through the majority of acceptance and defined three roles for each server:
Follower
Candidate
Leader
Each role are exclusion, the transference is arranged for serving election well:
- Every server initialized to Follower.
- Follower can be transferred into Candidate.
- Candidate can be elected Leader, or translate into Follower when election fails.
- Leader can only become Follower when it crashes.
When a follower receives no communication with leader over a period of time, the follower becomes a candidate and starts an election to request the voter. If the candidate gets the majority of voters, it will become the new leader. If one election term has elected two leaders, reopen a new election again. For preventing this situation, the re-election timeout has been set randomly.
The data writing is separated into two phrases. The one is sync the log entry, the other is the state written. When the leader receives a request from a client, it first sends the Appendentries to followers. After most majority servers have received the entries, the leader then notifies followers to apply the log entry. And then the majority of servers have applied the log entry and produced the same result, which means the log has been committed safely. Finally, the leader responded to the client for telling the writing operation has been committed successfully.
Majority acceptance is the key in Raft system. The 2PC also has the same majority acceptance:
- Ensuring the majority of servers has received the log entries.
- Ensuring the majority of servers has committed the state written.
If a leader crashes, a new election will start. In this timing, a follower become a candidate has a key restriction:
It must has the same committed log entries as the crashed leader. Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries.
If a follower crashes, duplicating the leader's log for keeping consistent.
A follower may receive the appendentries RPC from crashed leader and then vote for another candidate. This situation is a trouble, Raft use logic proved this situation does not happen, because it has two contradictions:
The candidate's log is shorter than follower.
The candidate's log is longer than follower.
These two situations nerver exist. A follower can be transferred to candidate based on its committed log must equal as leaders, and it never be shorter or longer than leader's.
Raft use these restrictions to avoid accidents:
The leader determined the order of appending.
Log entry just has the appending operation only.
The committed log cannot be overwrote.
Election timeout is randomized, to prevent two leaders winning the election.
Follower -> Candidate -> Leader -> Follower. These transfers let the leader election become easy.
Majarity acceptance(win election, commit log, become a candidate)
Follower's log can be forced to duplicate from leader when conflict occurred.
2PC committing and each of them must be accepted by majority of servers.
A candidate must have the committed log equalled as leader's.