Step 1

题目摘要引言

Title

There Is More Consensus in Egalitarian Parliaments

平等议会产生更多共识，题目设计的很有意思。

相较于 Paxos 其他变体的特征

availability without interruption as long as a simple majority of replicas are reachable---its availability is not interrupted when replicas crash or fail to respond;
uniform load balancing across all replicas---no replicas experience higher load because they have special roles; and （负载在全局环境下是均衡的，不会有机器因特殊身份而负载过量）
optimal commit latency in the wide-area when tolerating one and two failures, under realistic conditions.

Egalitarian Paxos is to our knowledge the first distributed consensus protocol to achieve all of these goals efficiently: requiring only a simple majority of replicas to be non-faulty, using a number of messages linear in the number of replicas to choose a command, and committing commands after just one communication round (one round trip) in the common case or after at most two rounds in any case.

Introduction

Distributed computing 对 Replication protocol 的需求

集群内拷贝的高吞吐量 high throughput for replication in-side a computing cluster;
跨数据中心的低延时 low latency for repli-cation across data centers

在无故障操作的情况下，会出现大量 clients 同单一 leader 通信的情况。"Multi-Paxos"的优化方案，是实际系统常用的优化手段，但是更换 leader 会带来额外的共识机制，大大降低了性能。

分析这种机制的 limitation

leader 的额外通信信息可能成为 bottleneck 从而限制了系统的 scalability
geo-replication 情况下，client 和远处的 leader 通信需要额外的开销
single-master，在前一个 master 失效的情况下，必须要等后一个 master 被选举，否则功能就是失效的。换言之，必须经过 master election 这个过程

Egalitarian Paxos (EPaxos) 没有指定的 leader 选举的过程，相对而言 client 可以任意的选择一个 replica 去传递信息，EPaxos 这种设计使得负载完全平摊到所有的机器中，也能更好地处理暂时、永久的缓慢节点，以及由于 geo-replication 引起的延迟 gap。

另外 high availability 体现在没有 leader（leader 选举过程）带来的服务中断。只要有多数的机器保持服务就行。

Section 2 和 Section 3 是进行传统 Paxos 及其变体的分析。EPaxos 的比较对象

Multi-Paxos
Mencius
Generalized Paxos

Section 7 是集中比较，比较说明 Mencius 在节点同构时是高效的，而 EPaxos 在更复杂的现实环境下，比如 wide-area replication、 failures 以及不同节点性能差异较大时，体现出整体更好的吞吐量。

基本理论概况

结论部分

给出了 EPaxos 是的设计与实现，一种新的状态机复制协议。证明其 decentralized 非中心化以及 uncoordinated 去协调设计对于设计实现 geo-replicated（wide area replication）且要求保证高可用、高性能的应用有理论与实践价值。

回答基本问题

类别

原型系统描述
内容

与 Raft、Paxos 关系更近，暂时没有理清和 Explicit Consistency、RedBlue Consistency 等的关系。

知乎上阿里技术的简要介绍
正确性

需要跑代码测试
创新点

从 leader/master 特殊身份的传统思维跳出
清晰度

清晰

阅读选择

继续阅读

Step 2

细读笔记

Paxos Background

状态机复制（state machine replication）的任务：Make a set of possibly faulty distributed processors execute the same commands in the same order.

\(N\) is the total number of replicas, at least \(\lfloor N/2 \rfloor+1\) must be non-faulty.

All Paxos variants handle non-Byzantine failures. Paxos 讨论的情形都是 fail-stop 型的故障，而非 Byzantine 型的故障。

状态机的执行是一系列 pre-ordered instatnce 序列。相似的，Raft 的叫 log？

在收到 client 的命令请求后，一个 replica 会通过向至少大多数 replica 发送Prepare消息，来尝试成为尚未使用的 instance 的领导者。关于Prepare消息的应答，可能是已经对该 instance 进行了选举（这种情形下不采用新提出的命令）。如果一个潜在的 leader 得到了二分之一多数的 Accept，那么它将根据这些接收消息，将提议发送这些二分之一多数结点。如果这些消息也被大多数 replica 接受了，那么 leader 回在本地却让人，并异步的通知其他 replica 和 client。

2 round trips 问题：传统的方式，都需要经过一轮对 leader 的 accept+一轮对 instance/log 的 commit？

Section 3 继续探讨其他 Paoxs 的变种对原有 Paxos 的改进

Egalitarian Paxos Overview

再度陈述，EPaxos 主要的优化目标：

optimal commit latency in the wide area (due to geo-replication)
optimal load balancing across all replicas, to achieve high throughput(remove the single leader canonical mode), and
graceful performance degradation when some replicas become slow or crash(keep fault tolerance)

A novel way to order commands:

先前的两种思路

order commands by having a single stable leader choose the order (as in Multi-Paxos and Generalized Paxos) 通过单一稳定的领导者来决定顺序
order commands by assignning them to instances(command slots) in a pre-ordered instance space (as in canonical Paxos and Mencius) 将命令分配到预先定好顺序的 slot 中

EPaxos 的方式： dynamic and decentralized

In contrast,EPaxos orders the instances dynamically and in a decentralized fashion: in the process of choosing (i.e., voting) acommand in an instance, each participant attaches order-ing constraints to that command 在实例中选择（即投票）命令的过程中，每个参与者将排序约束附加到该命令

这样看就是从一排全局 slot，像是变成了每个 replicate 一排自己的 slot + constraints，如何加入 constraints 很重要，这个需要关注一下代码实现。

order 的方式（很重要，据文章所讲这是 EPaxos 优势的源头）：

首先，commit command 取决于任何可能的多副本的输入，（不需要像 Multi-Paxos 一样要求 stable leader 必须包含在内，也不需要像 Mencius 需求来自全部 replica 的消息）。如此改善了大范围跨度的提交延迟、可用性以及性能的鲁棒性，因为这样把高最快的 replica 与最慢的 replica 分离开。
其次，propose command，可以从任何一个副本发出，而不是一个特别的 proposer 或者 leader，这有益于负载均衡，并提升吞吐量。

In taking this ordering approach, EPaxos must maintain safety and provide a linearizable ordering of commands. EPaxos 必须保证安全性，以及这些指令的可线性。

一些保证正确性与优化手段。

当 commands interfere 时，会接受彼此的“依赖”，用于产出正确的执行序列。

q Figure 1 展示了不同 Replica 之间，是否 interface 的不同情况。注意在 \(R3\) 和 \(R2/R4\) 相比较，只有 \(R3\) 先接受了 \(C4\)，后接受了 \(C3\) 所以返回了依赖（\(C3 \rightarrow C4\)要求）。

Multi-Paxos：由于建立了单一的 stable leader，每一项 command 的消息交互量，leader 是 non-leader 的 N 倍，从而容易变为影响系统可用性的 bottleneck，一些 Paxos 的应用实例（比如 Chubby）也证明类似设计当 leader 故障时，系统会暂时处于不可用状态。

This problem is not easily solved:aggressive leader re-election can cause stalls if multiplereplicas believe they are the leader. Chubby [4] and Box-wood [22] use Multi-Paxos, while ZooKeeper [12] relieson a stable leader protocol similar to Multi-Paxos.
Generalized Paxos：允许在不 interface 的情况下，乱序提交 commits。仍要求有一个 stable leader 来处理更多的信息。EPaxos 和 Generalized Paxos 相比优势，fast-path 需要沟通的 replica -1，slow path 只需要额外一轮（而不是两轮）

Detail

Preliminaries

预先的定义：

在 clients 和 replicas 间传递消息是异步的。
故障是 non-Byzantine 的，\(N = 2 F+1\)， \(F\) 表示最大容错数量

系统的整体状态包含每个 replica 拥有的序列 instances，可看作 N 行（对应 N 个 replica）的二维表格。与 Multi-Paxos 不同，instance 的序并不是预先定义的（pre-determined）是根据协议动态确定的。

需要区分指令的 commit 和 execute。要修改状态，client 向选定的 replica 发送 \(Request(command)\)。\(RequestReply\) 会通知 replica 指令的 commit 的情况，但是并没有是否执行的信息。仅在有客户机要读取先前提交的更新命令的时候，系统才有必要执行这些指令。要读取（部分）状态，client 发送\(Read(objectIDs)\)的消息并等待回复，是个 no-op 操作，相当于\(RequestAndRead(no-op,objectIDs)\)。

需要定义 interference，command interference：存在一个指令序列\(\Sigma\)使得\(\Sigma, \delta,\gamma\)执行结果（状态机状态不同，或返回不同内容）与\(\Sigma,\gamma,\delta\)

Protocol Guarantees

形式化保证，与其他 Paxos 变体类似

Nontriviality（非平凡）： replica commit 的命令必然由 client 提交
Stability（稳定性）：已提交的命令在任何时候都是之后时间提交命令的子集。更进一步，如果在时间点\(t_1\)，一个 replica \(R\) 具有命令\(\gamma\)提交在实例\(Q\)。则\(R\) 在任意 \(t_2 > t_1\)都有命令\(\gamma\)提交在实例\(Q\)。
Consistency：两个 replica 永远不会在相同 instance 有不同的提交命令。
Execution consistency: 如果两个 interfer 指令 \(\gamma\) 和 \(\delta\) 都被成功提交（被任意 replicas）（注意我们刚刚区分了 commit 不同于 execute），则可以保证在每一 replica 中，两个指令都以相同序被执行。
Execution linearizability：如果两个 interfer 指令 \(\gamma\) 和 \(\delta\) 被 replica 串行化/序列化过，那么保证在每个 replica，一个命令都会在另一个命令之前执行。
Liveness (w/ high probability)：指令最终会被每个正常的 replica 提交，只要过半数地 replicas 是正常的，并且信息在 time out 都传递到

需要区分一下 Execution Consistency 和 Execution linearizability

一个是如果都被 commit，则保证在任一 replica 上执行序是确定的。另一是如果被 client 序列化，则在任一 replica 上执行序也是固定的

Basic Commit Protocol

先介绍基础的，下一小节再介绍其升级版。

基础版本：

quorum: We usequorumto denote both a set of replicas with a particularcardinality, and the cardinality of that set. 使用 quorum 来表示一组具有特定基数的副本，以及该组的基数

basic EPaxos: fast quorum \(2F\) （out of \(2F+1\))
optimized EPaxos: fast quorum only \(F+ \lfloor \frac{F+1}{2} \rfloor\)

slow-path quorum size is always \(F+1\).

The Commit Protocol

正如前述，EPaxos 中命令的提交和执行是分离的。

EPaxos 包含两部分

用于确定（选择） commit commands 以及附加的 order constraint(attributes) 协议
基于这些 order 要求，确定实际 execute order 的算法

针对 1，commit protocol 被分割成了许多 phases，但并不是所有指令都会执行全部 phases

fast-path: 如果一个在经过 phase 1 就已经提交，则称为 fast-path slow-path: 涉及附加得 phase 2 过程（Paxos accept phase）

问题记录

未读（且值得读）文献记录

Step 3

思路复现

证明与推理复现

Design

实验验证复现

Original Presentation

State Machine Replication

Why Paxos popular?

fast protocol
No external failure detector required
Fast fail-over (high availability)

Overview of Paxos

Cononical:

2 RTTs per slot

Take ownership of a slot (who leader's the command)
Propose command

Multi-Paxos

1 RTT(omit the leader election)

EPaxos intuition

linear command slot space -> subspace with ownership to different replicas

attach ordering constraints

线性化的保证

两个性质

图、强连通分量

Q: sorting the dependencies seem quite expensive, so how do you keep this precedure small and how does this impact the request latency.

A: not expensive. It's just an algorithm of finding strong-connective component that's linear in the number of edges. In practice the number of edges similar to the number of nodes (not too many). Depends on the interface rate, but says even the interface rate is small, these graphs never gets too large.

Notes

code: 关注到 UpdateAttributes

test

Research Paper

三步法 Paper

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

如何更好的参与or完成一场ReadingGroup 上一篇

OSDI12——Making-Geo-Replicated-Systems-Fast-as-Possible 下一篇

SOSP13——There-Is-More-Consensus-in-Egalitarian-Parliamens