线性一致性读与优先级选举的实现

线性一致性读的概念与raft算法中实现线性一致性读的两种方法

raft算法中实现线性一致性读的两种方法

ReadIndex Read
Lease Read

ReadIndex Read

第一种是 ReadIndex Read，当 Leader 需要处理 Read 请求时，Leader 与过半机器交换心跳信息确定自己仍然是 Leader 后可提供线性一致读：

Leader 将自己当前 Log 的 commitIndex 记录到一个 Local 变量 ReadIndex 里面；

接着向 Followers 节点发起一轮 Heartbeat，如果半数以上节点返回对应的 Heartbeat Response，那么 Leader就能够确定现在自己仍然是 Leader；

Leader 等待自己的 StateMachine 状态机执行，至少应用到 ReadIndex 记录的 Log，直到 applyIndex 超过 ReadIndex，这样就能够安全提供 Linearizable Read，也不必管读的时刻是否 Leader 已飘走；

Leader 执行 Read 请求，将结果返回给 Client。

使用 ReadIndex Read 提供 Follower Read 的功能，很容易在 Followers 节点上面提供线性一致读，Follower 收到 Read 请求之后：

Follower 节点向 Leader 请求最新的 ReadIndex；

Leader 仍然走一遍之前的流程，执行上面前 3 步的过程(确定自己真的是 Leader)，并且返回 ReadIndex 给 Follower；

Follower 等待当前的状态机的 applyIndex 超过 ReadIndex；

Follower 执行 Read 请求，将结果返回给 Client。

不同于通过 Raft Log 的 Read，ReadIndex Read 使用 Heartbeat 方式来让 Leader 确认自己是 Leader，省去 Raft Log 流程。相比较于走 Raft Log 方式，ReadIndex Read 省去磁盘的开销，能够大幅度提升吞吐量。虽然仍然会有网络开销，但是 Heartbeat 本来就很小，所以性能还是非常好的。

Lease Read

虽然 ReadIndex Read 比原来的 Raft Log Read 快很多，但毕竟还是存在 Heartbeat 网络开销，所以考虑做更进一步的优化。Raft 论文里面提及一种通过 Clock + Heartbeat 的 Lease Read 优化方法，也就是 Leader 发送 Heartbeat 的时候首先记录一个时间点 Start，当系统大部分节点都回复 Heartbeat Response，由于 Raft 的选举机制，Follower 会在 Election Timeout 的时间之后才重新发生选举，下一个 Leader 选举出来的时间保证大于 Start+Election Timeout/Clock Drift Bound，所以可以认为 Leader 的 Lease 有效期可以到 Start+Election Timeout/Clock Drift Bound 时间点。Lease Read 与 ReadIndex 类似但更进一步优化，不仅节省 Log，而且省掉网络交互，大幅提升读的吞吐量并且能够显著降低延时。

Lease Read 基本思路是 Leader 取一个比 Election Timeout 小的租期（最好小一个数量级），在租约期内不会发生选举，确保 Leader 不会变化，所以跳过 ReadIndex 的第二步也就降低延时。由此可见 Lease Read 的正确性和时间是挂钩的，依赖本地时钟的准确性，因此虽然采用 Lease Read 做法非常高效，但是仍然面临风险问题，也就是存在预设的前提即各个服务器的 CPU Clock 的时间是准的，即使有误差，也会在一个非常小的 Bound 范围里面，时间的实现至关重要，如果时钟漂移严重，各个服务器之间 Clock 走的频率不一样，这套 Lease 机制可能出问题。

Lease Read 实现方式包括：

定时 Heartbeat 获得多数派响应，确认 Leader 的有效性；

在租约有效时间内，可以认为当前 Leader 是 Raft Group 内的唯一有效 Leader，可忽略 ReadIndex 中的 Heartbeat 确认步骤(2)；

Leader 等待自己的状态机执行，直到 applyIndex 超过 ReadIndex，这样就能够安全的提供 Linearizable Read。

上述两段引用自 https://developer.aliyun.com/article/707092

sofa-jraft对线性一致性读的实现

sofa-jraft对于ReadIndex Read和Lease Read两种方法都支持，根据配置决定采用哪种，默认是ReadIndex Read。

入口是 com.alipay.sofa.jraft.core.NodeImpl.handleReadIndexRequest(ReadIndexRequest, RpcResponseClosure)

readLeader关键代码：

// com.alipay.sofa.jraft.core.NodeImpl.readLeader(ReadIndexRequest, Builder, RpcResponseClosure<ReadIndexResponse>)
ReadOnlyOption readOnlyOpt = this.raftOptions.getReadOnlyOptions();
if (readOnlyOpt == ReadOnlyOption.ReadOnlyLeaseBased && !isLeaderLeaseValid()) { // isLeaderLeaseValid中会确认leader的租期是否有效，无效就走ReadOnlySafe模式了！
  // If leader lease timeout, we must change option to ReadOnlySafe
  readOnlyOpt = ReadOnlyOption.ReadOnlySafe;
}

switch (readOnlyOpt) { // 根据配置的选项区分 
  case ReadOnlySafe:
    final List<PeerId> peers = this.conf.getConf().getPeers();
    Requires.requireTrue(peers != null && !peers.isEmpty(), "Empty peers");
    final ReadIndexHeartbeatResponseClosure heartbeatDone = new ReadIndexHeartbeatResponseClosure(closure,
                                                                                                  respBuilder, quorum, peers.size());
    // Send heartbeat requests to followers
    for (final PeerId peer : peers) {
      if (peer.equals(this.serverId)) {
        continue;
      }
      this.replicatorGroup.sendHeartbeat(peer, heartbeatDone); // ReadOnlySafe 方式的要给Follower发送心跳，以用来确认leader有效。
    }
    break;
  case ReadOnlyLeaseBased: // ReadOnlyLeaseBased 方式的直接返回
    // Responses to followers and local node.
    respBuilder.setSuccess(true);
    closure.setResponse(respBuilder.build());
    closure.run(Status.OK());
    break;
}

优先级选举

假设部署 Raft 集群的服务器采用不同性能规格，业务用户总是期望 Leader 角色节点总是在性能最强的服务器上，这样能够为客户端提供较好的读写能力，而上面这种“随机超时时间选举机制”将不能满足需求；

于会存在选票被瓜分的场景，集群中的各个 Candidate 角色节点将在下一个周期内重新发起选举。而在这个极短的时间内，由于集群中不存在 Leader 角色所以是无法正常向客户端提供读写能力，因此业务用户需要通过其他方式来避免短时间的不可用造成的影响；

SOFAJRaft 基于优先级的半确定性选举机制。实现不是很复杂，主要在下面两个方法中达成。

com.alipay.sofa.jraft.core.NodeImpl.allowLaunchElection()

com.alipay.sofa.jraft.core.NodeImpl.decayTargetPriority()