From commits-return-7928-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Aug 28 21:03:24 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B6FB4180181 for ; Wed, 28 Aug 2019 23:03:23 +0200 (CEST) Received: (qmail 59582 invoked by uid 500); 28 Aug 2019 21:03:23 -0000 Mailing-List: contact commits-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list commits@zookeeper.apache.org Received: (qmail 59570 invoked by uid 99); 28 Aug 2019 21:03:23 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Aug 2019 21:03:23 +0000 Received: by gitbox.apache.org (ASF Mail Server at gitbox.apache.org, from userid 33) id EF7B386003; Wed, 28 Aug 2019 21:03:22 +0000 (UTC) Date: Wed, 28 Aug 2019 21:03:22 +0000 To: "commits@zookeeper.apache.org" Subject: [zookeeper] branch master updated: ZOOKEEPER-3522: Consistency guarantees discussion. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Message-ID: <156702620284.26880.1061882977274613445@gitbox.apache.org> From: hanm@apache.org X-Git-Host: gitbox.apache.org X-Git-Repo: zookeeper X-Git-Refname: refs/heads/master X-Git-Reftype: branch X-Git-Oldrev: 151003722e4f1a30aed0b686890fb35adcaf1f2d X-Git-Newrev: 650aea33eddef9054c90ca47183cde0e58cabcd9 X-Git-Rev: 650aea33eddef9054c90ca47183cde0e58cabcd9 X-Git-NotificationType: ref_changed_plus_diff X-Git-Multimail-Version: 1.5.dev Auto-Submitted: auto-generated This is an automated email from the ASF dual-hosted git repository. hanm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/zookeeper.git The following commit(s) were added to refs/heads/master by this push: new 650aea3 ZOOKEEPER-3522: Consistency guarantees discussion. 650aea3 is described below commit 650aea33eddef9054c90ca47183cde0e58cabcd9 Author: Karolos Antoniadis AuthorDate: Wed Aug 28 14:03:12 2019 -0700 ZOOKEEPER-3522: Consistency guarantees discussion. There seems to be some confusion regarding the exact consistency guarantees that ZooKeeper provides. For example, does it provide linearizable reads? After the recent discussion in the [dev mailing list](https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/), I tried to clarify this in the ZooKeeper documentation. Author: Karolos Antoniadis Reviewers: Alexander Shraer , maoling , Michael Han Closes #1063 from insumity/ZOOKEEPER-3522 --- .../main/resources/markdown/zookeeperInternals.md | 86 +++++++++++++--------- 1 file changed, 52 insertions(+), 34 deletions(-) diff --git a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md index a3a5673..25d9be8 100644 --- a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md +++ b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md @@ -23,6 +23,7 @@ limitations under the License. * [Active Messaging](#sc_activeMessaging) * [Summary](#sc_summary) * [Comparisons](#sc_comparisons) +* [Consistency Guarantees](#sc_consistency) * [Quorums](#sc_quorum) * [Logging](#sc_logging) * [Developer Guidelines](#sc_developerGuidelines) @@ -34,9 +35,11 @@ limitations under the License. ## Introduction This document contains information on the inner workings of ZooKeeper. -So far, it discusses these topics: +It discusses the following topics: * [Atomic Broadcast](#sc_atomicBroadcast) +* [Consistency Guarantees](#sc_consistency) +* [Quorums](#sc_quorum) * [Logging](#sc_logging) @@ -52,18 +55,17 @@ At the heart of ZooKeeper is an atomic messaging system that keeps all of the se The specific guarantees provided by the messaging system used by ZooKeeper are the following: * *_Reliable delivery_* : - If a message, m, is delivered - by one server, it will be eventually delivered by all servers. + If a message `m`, is delivered + by one server, message `m` will be eventually delivered by all servers. * *_Total order_* : - If a message is - delivered before message b by one server, a will be delivered before b by all - servers. If a and b are delivered messages, either a will be delivered before b - or b will be delivered before a. + If a message `a` is + delivered before message `b` by one server, message `a` will be delivered before `b` by all + servers. * *_Causal order_* : - If a message b is sent after a message a has been delivered by the sender of b, - a must be ordered before b. If a sender sends c after sending b, c must be ordered after b. + If a message `b` is sent after a message `a` has been delivered by the sender of `b`, + message `a` must be ordered before `b`. If a sender sends `c` after sending `b`, `c` must be ordered after `b`. The ZooKeeper messaging system also needs to be efficient, reliable, and easy to implement and maintain. We make heavy use of messaging, so we need the system to @@ -80,29 +82,29 @@ lose or reorder messages, our assumption of FIFO channels is very practical given that we use TCP for communication. Specifically we rely on the following property of TCP: * *_Ordered delivery_* : - Data is delivered in the same order it is sent and a message m is - delivered only after all messages sent before m have been delivered. - (The corollary to this is that if message m is lost all messages after m will be lost.) + Data is delivered in the same order it is sent and a message `m` is + delivered only after all messages sent before `m` have been delivered. + (The corollary to this is that if message `m` is lost all messages after `m` will be lost.) * *_No message after close_* : Once a FIFO channel is closed, no messages will be received from it. FLP proved that consensus cannot be achieved in asynchronous distributed systems -if failures are possible. To ensure we achieve consensus in the presence of failures -we use timeouts. However, we rely on times for liveness not for correctness. So, -if timeouts stop working (clocks malfunction for example) the messaging system may +if failures are possible. To ensure that we achieve consensus in the presence of failures +we use timeouts. However, we rely on time for liveness not for correctness. So, +if timeouts stop working (e.g., skewed clocks) the messaging system may hang, but it will not violate its guarantees. When describing the ZooKeeper messaging protocol we will talk of packets, proposals, and messages: * *_Packet_* : - a sequence of bytes sent through a FIFO channel + a sequence of bytes sent through a FIFO channel. * *_Proposal_* : a unit of agreement. Proposals are agreed upon by exchanging packets with a quorum of ZooKeeper servers. Most proposals contain messages, however the - NEW_LEADER proposal is an example of a proposal that does not correspond to a message. + NEW_LEADER proposal is an example of a proposal that does not contain to a message. * *_Message_* : a sequence of bytes to be atomically broadcast to all ZooKeeper @@ -121,7 +123,7 @@ n is the number of servers that make up a ZooKeeper service. The zxid has two parts: the epoch and a counter. In our implementation the zxid is a 64-bit number. We use the high order 32-bits for the epoch and the low order -32-bits for the counter. Because it has two parts represent the zxid both as a +32-bits for the counter. Because zxid consists of two parts, zxid can be represented both as a number and as a pair of integers, (_epoch, count_). The epoch number represents a change in leadership. Each time a new leader comes into power it will have its own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: @@ -146,18 +148,15 @@ up with the leader, they have the same state. This state consists of all of the proposals that the leader believes have been committed and the proposal to follow the leader, the NEW_LEADER proposal. (Hopefully you are thinking to yourself, _Does the set of proposals that the leader believes has been committed -included all the proposals that really have been committed?_ The answer is _yes_. +include all the proposals that really have been committed?_ The answer is _yes_. Below, we make clear why.) ### Leader Activation -Leader activation includes leader election. We currently have two leader election -algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection -is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple -form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the -exact method of electing a leader as long as the following holds: +Leader activation includes leader election (`FastLeaderElection`). +ZooKeeper messaging doesn't care about the exact method of electing a leader as long as the following holds: * The leader has seen the highest zxid of all the followers. * A quorum of servers have committed to following the leader. @@ -170,18 +169,18 @@ we will recover by abandoning leader activation and running another election. After leader election a single server will be designated as a leader and start waiting for followers to connect. The rest of the servers will try to connect to -the leader. The leader will sync up with followers by sending any proposals they +the leader. The leader will sync up with the followers by sending any proposals they are missing, or if a follower is missing too many proposals, it will send a full snapshot of the state to the follower. -There is a corner case in which a follower that has proposals, U, not seen -by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids +There is a corner case in which a follower that has proposals, `U`, not seen +by a leader arrives. Proposals are seen in order, so the proposals of `U` will have a zxids higher than zxids seen by the leader. The follower must have arrived after the leader election, otherwise the follower would have been elected leader given that it has seen a higher zxid. Since committed proposals must be seen by a quorum of -servers, and a quorum of servers that elected the leader did not see U, the proposals -of you have not been committed, so they can be discarded. When the follower connects -to the leader, the leader will tell the follower to discard U. +servers, and a quorum of servers that elected the leader did not see `U`, the proposals +of `U` have not been committed, so they can be discarded. When the follower connects +to the leader, the leader will tell the follower to discard `U`. A new leader establishes a zxid to start using for new proposals by getting the epoch, e, of the highest zxid it has seen and setting the next zxid to use to be @@ -226,9 +225,9 @@ the following operating constraints are observed: received. Because we use FIFO channels this means that followers also receive proposals in order. * Followers process messages in the order they are received. This means that messages will be ACKed in order and the leader will receive ACKs from - followers in order, due to the FIFO channels. It also means that if message $m$ + followers in order, due to the FIFO channels. It also means that if message `m` has been written to non-volatile storage, all messages that were proposed before - $m$ have been written to non-volatile storage. + `m` have been written to non-volatile storage. * The leader will issue a COMMIT to all followers as soon as a quorum of followers have ACKed a message. Since messages are ACKed in order, COMMITs will be sent by the leader as received by the followers in order. @@ -267,6 +266,26 @@ all packets, it all falls apart. Also, our leader activation phase is different both of them. In particular, our use of epochs allows us to skip blocks of uncommitted proposals and to not worry about duplicate proposals for a given zxid. + + + +## Consistency Guarantees + +ZooKeeper [consistency](https://jepsen.io/consistency) guarantees lie between sequential consistency and linearizabiliy. Here, we explain the exact consistency guarantees that ZooKepeer provides. + +Write operations in ZooKeeper are linearizabile. In other words, each write appears to take effect atomically at some point between its invocation and its response. This means that the writes performed by all the clients in ZooKeeper can be totally ordered in such a way that respects the real-time ordering of these writes. However, note that just stating that writes are linearizable is meaningless unless we also talk about read operations. + +Read operations in ZooKeeper are not linearizable since they can return potentially stale data. This occurs since a read in ZooKeeper is not a quorum operation and a server responds immediately to a client that is performing a read. +Nevertheless, ZooKeeper makes this choice because it chooses performance in the trade-off between performance and consistency. ZooKeeper read operations are sequentially-consistent, since read operations appear to take effect in some sequential order that furthermore respects the order of each client's operations. +If a client wants to read the freshest data, it is generally assumed that the client should first perform a sync operation, and then a read. +However, even with a sync before a read operation, a client might retrieve stale data. +This can occur because `sync` is [not a quorum operation](https://issues.apache.org/jira/browse/ZOOKEEPER-1675). Such a scenario might appear if two servers think that they are the leaders at the same time, which may occur if the time it takes for a TCP connection to drop is smaller than `syncLimit * tickTime`, something that is [unlikely](https://www.amazon.com/ZooKeeper-Distributed-Coordination-Flavio-Junqueira/dp/1449361307) to occur in practice. + + +This raises the question on what are the exact consistency guarantees of ZooKeeper? +Formally, the ZooKeeper consistency guarantees are captured by the notion of [ordered sequential consistency](http://webee.technion.ac.il/people/idish/ftp/OSC-IPL17.pdf) or `OSC(U)` to be exact, that lies between sequential consistency and linearizability. +Finally, note that the current version of ZooKeeper can provide linearizability for both reads and writes, if every read is preceded by a write to some dummy znode. + ## Quorums @@ -313,8 +332,7 @@ of the [ZooKeeper Administrator's Guide.](zookeeperAdmin.html) ### Developer Guidelines Please follow the [slf4j manual](http://www.slf4j.org/manual.html) when creating log statements within code. -Also read the[FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance) -, when creating log statements. Patch reviewers will look for the following: +Also read the [FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance), when creating log statements. Patch reviewers will look for the following: