zookeeper-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From h...@apache.org
Subject [zookeeper] branch master updated: ZOOKEEPER-3522: Consistency guarantees discussion.
Date Wed, 28 Aug 2019 21:03:22 GMT
This is an automated email from the ASF dual-hosted git repository.

hanm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zookeeper.git


The following commit(s) were added to refs/heads/master by this push:
     new 650aea3  ZOOKEEPER-3522: Consistency guarantees discussion.
650aea3 is described below

commit 650aea33eddef9054c90ca47183cde0e58cabcd9
Author: Karolos Antoniadis <karolos@gmail.com>
AuthorDate: Wed Aug 28 14:03:12 2019 -0700

    ZOOKEEPER-3522: Consistency guarantees discussion.
    
    There seems to be some confusion regarding the exact consistency guarantees that ZooKeeper
provides. For example, does it provide linearizable reads?
    
    After the recent discussion in the [dev mailing list](https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/<CAO%3DK-y1r4FYEtDsQyeVc5poPw6EnzVbMDzTgMv9tcrggMX8AbQ%40mail.gmail.com>),
I tried to clarify this in the ZooKeeper documentation.
    
    Author: Karolos Antoniadis <karolos@gmail.com>
    
    Reviewers: Alexander Shraer <shralex@gmail.com>, maoling <maoling199210191@sina.com>,
Michael Han <hanm@apache.org>
    
    Closes #1063 from insumity/ZOOKEEPER-3522
---
 .../main/resources/markdown/zookeeperInternals.md  | 86 +++++++++++++---------
 1 file changed, 52 insertions(+), 34 deletions(-)

diff --git a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
index a3a5673..25d9be8 100644
--- a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
+++ b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
@@ -23,6 +23,7 @@ limitations under the License.
     * [Active Messaging](#sc_activeMessaging)
     * [Summary](#sc_summary)
     * [Comparisons](#sc_comparisons)
+* [Consistency Guarantees](#sc_consistency)
 * [Quorums](#sc_quorum)
 * [Logging](#sc_logging)
     * [Developer Guidelines](#sc_developerGuidelines)
@@ -34,9 +35,11 @@ limitations under the License.
 ## Introduction
 
 This document contains information on the inner workings of ZooKeeper.
-So far, it discusses these topics:
+It discusses the following topics:
 
 * [Atomic Broadcast](#sc_atomicBroadcast)
+* [Consistency Guarantees](#sc_consistency)
+* [Quorums](#sc_quorum)
 * [Logging](#sc_logging)
 
 <a name="sc_atomicBroadcast"></a>
@@ -52,18 +55,17 @@ At the heart of ZooKeeper is an atomic messaging system that keeps all
of the se
 The specific guarantees provided by the messaging system used by ZooKeeper are the following:
 
 * *_Reliable delivery_* :
-    If a message, m, is delivered
-    by one server, it will be eventually delivered by all servers.
+    If a message `m`, is delivered
+    by one server, message `m` will be eventually delivered by all servers.
 
 * *_Total order_* :
-    If a message is
-    delivered before message b by one server, a will be delivered before b by all
-    servers. If a and b are delivered messages, either a will be delivered before b
-    or b will be delivered before a.
+    If a message `a` is
+    delivered before message `b` by one server, message `a` will be delivered before `b`
by all
+    servers.
 
 * *_Causal order_* :
-    If a message b is sent after a message a has been delivered by the sender of b,
-    a must be ordered before b. If a sender sends c after sending b, c must be ordered after
b.
+    If a message `b` is sent after a message `a` has been delivered by the sender of `b`,
+    message `a` must be ordered before `b`. If a sender sends `c` after sending `b`, `c`
must be ordered after `b`.
 
 The ZooKeeper messaging system also needs to be efficient, reliable, and easy to
 implement and maintain. We make heavy use of messaging, so we need the system to
@@ -80,29 +82,29 @@ lose or reorder messages, our assumption of FIFO channels is very practical
 given that we use TCP for communication. Specifically we rely on the following property of
TCP:
 
 * *_Ordered delivery_* :
-    Data is delivered in the same order it is sent and a message m is
-    delivered only after all messages sent before m have been delivered.
-    (The corollary to this is that if message m is lost all messages after m will be lost.)
+    Data is delivered in the same order it is sent and a message `m` is
+    delivered only after all messages sent before `m` have been delivered.
+    (The corollary to this is that if message `m` is lost all messages after `m` will be
lost.)
 
 * *_No message after close_* :
     Once a FIFO channel is closed, no messages will be received from it.
 
 FLP proved that consensus cannot be achieved in asynchronous distributed systems
-if failures are possible. To ensure we achieve consensus in the presence of failures
-we use timeouts. However, we rely on times for liveness not for correctness. So,
-if timeouts stop working (clocks malfunction for example) the messaging system may
+if failures are possible. To ensure that we achieve consensus in the presence of failures
+we use timeouts. However, we rely on time for liveness not for correctness. So,
+if timeouts stop working (e.g., skewed clocks) the messaging system may
 hang, but it will not violate its guarantees.
 
 When describing the ZooKeeper messaging protocol we will talk of packets,
 proposals, and messages:
 
 * *_Packet_* :
-    a sequence of bytes sent through a FIFO channel
+    a sequence of bytes sent through a FIFO channel.
 
 * *_Proposal_* :
     a unit of agreement. Proposals are agreed upon by exchanging packets
     with a quorum of ZooKeeper servers. Most proposals contain messages, however the
-    NEW_LEADER proposal is an example of a proposal that does not correspond to a message.
+    NEW_LEADER proposal is an example of a proposal that does not contain to a message.
 
 * *_Message_* :
     a sequence of bytes to be atomically broadcast to all ZooKeeper
@@ -121,7 +123,7 @@ n is the number of servers that make up a ZooKeeper service.
 
 The zxid has two parts: the epoch and a counter. In our implementation the zxid
 is a 64-bit number. We use the high order 32-bits for the epoch and the low order
-32-bits for the counter. Because it has two parts represent the zxid both as a
+32-bits for the counter. Because zxid consists of two parts, zxid can be represented both
as a
 number and as a pair of integers, (_epoch, count_). The epoch number represents a
 change in leadership. Each time a new leader comes into power it will have its
 own epoch number. We have a simple algorithm to assign a unique zxid to a proposal:
@@ -146,18 +148,15 @@ up with the leader, they have the same state. This state consists of
all of the
 proposals that the leader believes have been committed and the proposal to follow
 the leader, the NEW_LEADER proposal. (Hopefully you are thinking to
 yourself, _Does the set of proposals that the leader believes has been committed
-included all the proposals that really have been committed?_ The answer is _yes_.
+include all the proposals that really have been committed?_ The answer is _yes_.
 Below, we make clear why.)
 
 <a name="sc_leaderElection"></a>
 
 ### Leader Activation
 
-Leader activation includes leader election. We currently have two leader election
-algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection
-is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple
-form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the
-exact method of electing a leader as long as the following holds:
+Leader activation includes leader election (`FastLeaderElection`).
+ZooKeeper messaging doesn't care about the exact method of electing a leader as long as the
following holds:
 
 * The leader has seen the highest zxid of all the followers.
 * A quorum of servers have committed to following the leader.
@@ -170,18 +169,18 @@ we will recover by abandoning leader activation and running another
election.
 
 After leader election a single server will be designated as a leader and start
 waiting for followers to connect. The rest of the servers will try to connect to
-the leader. The leader will sync up with followers by sending any proposals they
+the leader. The leader will sync up with the followers by sending any proposals they
 are missing, or if a follower is missing too many proposals, it will send a full
 snapshot of the state to the follower.
 
-There is a corner case in which a follower that has proposals, U, not seen
-by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids
+There is a corner case in which a follower that has proposals, `U`, not seen
+by a leader arrives. Proposals are seen in order, so the proposals of `U` will have a zxids
 higher than zxids seen by the leader. The follower must have arrived after the
 leader election, otherwise the follower would have been elected leader given that
 it has seen a higher zxid. Since committed proposals must be seen by a quorum of
-servers, and a quorum of servers that elected the leader did not see U, the proposals
-of you have not been committed, so they can be discarded. When the follower connects
-to the leader, the leader will tell the follower to discard U.
+servers, and a quorum of servers that elected the leader did not see `U`, the proposals
+of `U` have not been committed, so they can be discarded. When the follower connects
+to the leader, the leader will tell the follower to discard `U`.
 
 A new leader establishes a zxid to start using for new proposals by getting the
 epoch, e, of the highest zxid it has seen and setting the next zxid to use to be
@@ -226,9 +225,9 @@ the following operating constraints are observed:
   received. Because we use FIFO channels this means that followers also receive proposals
in order.
 * Followers process messages in the order they are received. This
   means that messages will be ACKed in order and the leader will receive ACKs from
-  followers in order, due to the FIFO channels. It also means that if message $m$
+  followers in order, due to the FIFO channels. It also means that if message `m`
   has been written to non-volatile storage, all messages that were proposed before
-  $m$ have been written to non-volatile storage.
+  `m` have been written to non-volatile storage.
 * The leader will issue a COMMIT to all followers as soon as a
   quorum of followers have ACKed a message. Since messages are ACKed in order,
   COMMITs will be sent by the leader as received by the followers in order.
@@ -267,6 +266,26 @@ all packets, it all falls apart. Also, our leader activation phase is
different
 both of them. In particular, our use of epochs allows us to skip blocks of uncommitted
 proposals and to not worry about duplicate proposals for a given zxid.
 
+<a name="sc_consistency"></a>
+
+
+## Consistency Guarantees
+
+ZooKeeper [consistency](https://jepsen.io/consistency) guarantees lie between sequential
consistency and linearizabiliy. Here, we explain the exact consistency guarantees that ZooKepeer
provides.
+
+Write operations in ZooKeeper are linearizabile. In other words, each write appears to take
effect atomically at some point between its invocation and its response. This means that the
writes performed by all the clients in ZooKeeper can be totally ordered in such a way that
respects the real-time ordering of these writes. However, note that just stating that writes
are linearizable is meaningless unless we also talk about read operations.
+
+Read operations in ZooKeeper are not linearizable since they can return potentially stale
data. This occurs since a read in ZooKeeper is not a quorum operation and a server responds
immediately to a client that is performing a read.
+Nevertheless, ZooKeeper makes this choice because it chooses performance in the trade-off
between performance and consistency. ZooKeeper read operations are sequentially-consistent,
since read operations appear to take effect in some sequential order that furthermore respects
the order of each client's operations. 
+If a client wants to read the freshest data, it is generally assumed that the client should
first perform a sync operation, and then a read.
+However, even with a sync before a read operation, a client might retrieve stale data.
+This can occur because `sync` is [not a quorum operation](https://issues.apache.org/jira/browse/ZOOKEEPER-1675).
Such a scenario might appear if two servers think that they are the leaders at the same time,
which may occur if the time it takes for a TCP connection to drop is smaller than `syncLimit
* tickTime`, something that is [unlikely](https://www.amazon.com/ZooKeeper-Distributed-Coordination-Flavio-Junqueira/dp/1449361307)
to occur in practice.
+
+
+This raises the question on what are the exact consistency guarantees of ZooKeeper?
+Formally, the ZooKeeper consistency guarantees are captured by the notion of [ordered sequential
consistency](http://webee.technion.ac.il/people/idish/ftp/OSC-IPL17.pdf) or `OSC(U)` to be
exact, that lies  between sequential consistency and linearizability.
+Finally, note that the current version of ZooKeeper can provide linearizability for both
reads and writes, if every read is preceded by a write to some dummy znode. 
+
 <a name="sc_quorum"></a>
 
 ## Quorums
@@ -313,8 +332,7 @@ of the [ZooKeeper Administrator's Guide.](zookeeperAdmin.html)
 ### Developer Guidelines
 
 Please follow the  [slf4j manual](http://www.slf4j.org/manual.html) when creating log statements
within code.
-Also read the[FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance)
-, when creating log statements. Patch reviewers will look for the following:
+Also read the [FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance), when
creating log statements. Patch reviewers will look for the following:
 
 <a name="sc_rightLevel"></a>
 


Mime
View raw message