From commits-return-6492-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Jul 4 15:11:24 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id F06F618067B for ; Wed, 4 Jul 2018 15:11:22 +0200 (CEST) Received: (qmail 3134 invoked by uid 500); 4 Jul 2018 13:11:22 -0000 Mailing-List: contact commits-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list commits@zookeeper.apache.org Received: (qmail 3073 invoked by uid 99); 4 Jul 2018 13:11:22 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2018 13:11:21 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 3DD5DDFFAC; Wed, 4 Jul 2018 13:11:21 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: andor@apache.org To: commits@zookeeper.apache.org Date: Wed, 04 Jul 2018 13:11:23 -0000 Message-Id: <3b216936e1354998bfe93a1e88d9b148@git.apache.org> In-Reply-To: <45eadf6902df4a1eb6bc6ad6d0c03dc1@git.apache.org> References: <45eadf6902df4a1eb6bc6ad6d0c03dc1@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [03/12] zookeeper git commit: ZOOKEEPER-3022: MAVEN MIGRATION 3.4 - Iteration 1 - docs, it http://git-wip-us.apache.org/repos/asf/zookeeper/blob/c1efa954/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml new file mode 100644 index 0000000..4954123 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml @@ -0,0 +1,487 @@ + + + + +
+ ZooKeeper Internals + + + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License. + + + + This article contains topics which discuss the inner workings of + ZooKeeper. So far, that's logging and atomic broadcast. + + + + +
+ Introduction + + This document contains information on the inner workings of ZooKeeper. + So far, it discusses these topics: + + + + + + + +
+ +
+Atomic Broadcast + + +At the heart of ZooKeeper is an atomic messaging system that keeps all of the servers in sync. + +
Guarantees, Properties, and Definitions + +The specific guarantees provided by the messaging system used by ZooKeeper are the following: + + + +Reliable delivery +If a message, m, is delivered +by one server, it will be eventually delivered by all servers. + +Total order + If a message is +delivered before message b by one server, a will be delivered before b by all +servers. If a and b are delivered messages, either a will be delivered before b +or b will be delivered before a. + +Causal order + + +If a message b is sent after a message a has been delivered by the sender of b, +a must be ordered before b. If a sender sends c after sending b, c must be ordered after b. + + + + + + +The ZooKeeper messaging system also needs to be efficient, reliable, and easy to +implement and maintain. We make heavy use of messaging, so we need the system to +be able to handle thousands of requests per second. Although we can require at +least k+1 correct servers to send new messages, we must be able to recover from +correlated failures such as power outages. When we implemented the system we had +little time and few engineering resources, so we needed a protocol that is +accessible to engineers and is easy to implement. We found that our protocol +satisfied all of these goals. + + + + +Our protocol assumes that we can construct point-to-point FIFO channels between +the servers. While similar services usually assume message delivery that can +lose or reorder messages, our assumption of FIFO channels is very practical +given that we use TCP for communication. Specifically we rely on the following property of TCP: + + + + +Ordered delivery +Data is delivered in the same order it is sent and a message m is +delivered only after all messages sent before m have been delivered. +(The corollary to this is that if message m is lost all messages after m will be lost.) + +No message after close +Once a FIFO channel is closed, no messages will be received from it. + + + + +FLP proved that consensus cannot be achieved in asynchronous distributed systems +if failures are possible. To ensure we achieve consensus in the presence of failures +we use timeouts. However, we rely on times for liveness not for correctness. So, +if timeouts stop working (clocks malfunction for example) the messaging system may +hang, but it will not violate its guarantees. + +When describing the ZooKeeper messaging protocol we will talk of packets, +proposals, and messages: + +Packet +a sequence of bytes sent through a FIFO channel + +Proposal +a unit of agreement. Proposals are agreed upon by exchanging packets +with a quorum of ZooKeeper servers. Most proposals contain messages, however the +NEW_LEADER proposal is an example of a proposal that does not correspond to a message. + + +Message +a sequence of bytes to be atomically broadcast to all ZooKeeper +servers. A message put into a proposal and agreed upon before it is delivered. + + + + + +As stated above, ZooKeeper guarantees a total order of messages, and it also +guarantees a total order of proposals. ZooKeeper exposes the total ordering using +a ZooKeeper transaction id (zxid). All proposals will be stamped with a zxid when +it is proposed and exactly reflects the total ordering. Proposals are sent to all +ZooKeeper servers and committed when a quorum of them acknowledge the proposal. +If a proposal contains a message, the message will be delivered when the proposal +is committed. Acknowledgement means the server has recorded the proposal to persistent storage. +Our quorums have the requirement that any pair of quorum must have at least one server +in common. We ensure this by requiring that all quorums have size (n/2+1) where +n is the number of servers that make up a ZooKeeper service. + + + +The zxid has two parts: the epoch and a counter. In our implementation the zxid +is a 64-bit number. We use the high order 32-bits for the epoch and the low order +32-bits for the counter. Because it has two parts represent the zxid both as a +number and as a pair of integers, (epoch, count). The epoch number represents a +change in leadership. Each time a new leader comes into power it will have its +own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: +the leader simply increments the zxid to obtain a unique zxid for each proposal. +Leadership activation will ensure that only one leader uses a given epoch, so our +simple algorithm guarantees that every proposal will have a unique id. + + + +ZooKeeper messaging consists of two phases: + + +Leader activation +In this phase a leader establishes the correct state of the system +and gets ready to start making proposals. + + +Active messaging +In this phase a leader accepts messages to propose and coordinates message delivery. + + + + +ZooKeeper is a holistic protocol. We do not focus on individual proposals, rather +look at the stream of proposals as a whole. Our strict ordering allows us to do this +efficiently and greatly simplifies our protocol. Leadership activation embodies +this holistic concept. A leader becomes active only when a quorum of followers +(The leader counts as a follower as well. You can always vote for yourself ) has synced +up with the leader, they have the same state. This state consists of all of the +proposals that the leader believes have been committed and the proposal to follow +the leader, the NEW_LEADER proposal. (Hopefully you are thinking to +yourself, Does the set of proposals that the leader believes has been committed +included all the proposals that really have been committed? The answer is yes. +Below, we make clear why.) + + +
+ +
+ +Leader Activation + +Leader activation includes leader election. We currently have two leader election +algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection +is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple +form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the +exact method of electing a leader has long as the following holds: + + + + +The leader has seen the highest zxid of all the followers. +A quorum of servers have committed to following the leader. + + + + +Of these two requirements only the first, the highest zxid amoung the followers +needs to hold for correct operation. The second requirement, a quorum of followers, +just needs to hold with high probability. We are going to recheck the second requirement, +so if a failure happens during or after the leader election and quorum is lost, +we will recover by abandoning leader activation and running another election. + + + +After leader election a single server will be designated as a leader and start +waiting for followers to connect. The rest of the servers will try to connect to +the leader. The leader will sync up with followers by sending any proposals they +are missing, or if a follower is missing too many proposals, it will send a full +snapshot of the state to the follower. + + + +There is a corner case in which a follower that has proposals, U, not seen +by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids +higher than zxids seen by the leader. The follower must have arrived after the +leader election, otherwise the follower would have been elected leader given that +it has seen a higher zxid. Since committed proposals must be seen by a quorum of +servers, and a quorum of servers that elected the leader did not see U, the proposals +of you have not been committed, so they can be discarded. When the follower connects +to the leader, the leader will tell the follower to discard U. + + + +A new leader establishes a zxid to start using for new proposals by getting the +epoch, e, of the highest zxid it has seen and setting the next zxid to use to be +(e+1, 0), fter the leader syncs with a follower, it will propose a NEW_LEADER +proposal. Once the NEW_LEADER proposal has been committed, the leader will activate +and start receiving and issuing proposals. + + + +It all sounds complicated but here are the basic rules of operation during leader +activation: + + + +A follower will ACK the NEW_LEADER proposal after it has synced with the leader. +A follower will only ACK a NEW_LEADER proposal with a given zxid from a single server. +A new leader will COMMIT the NEW_LEADER proposal when a quorum of followers have ACKed it. +A follower will commit any state it received from the leader when the NEW_LEADER proposal is COMMIT. +A new leader will not accept new proposals until the NEW_LEADER proposal has been COMMITED. + + + +If leader election terminates erroneously, we don't have a problem since the +NEW_LEADER proposal will not be committed since the leader will not have quorum. +When this happens, the leader and any remaining followers will timeout and go back +to leader election. + + +
+ +
+Active Messaging + +Leader Activation does all the heavy lifting. Once the leader is coronated he can +start blasting out proposals. As long as he remains the leader no other leader can +emerge since no other leader will be able to get a quorum of followers. If a new +leader does emerge, +it means that the leader has lost quorum, and the new leader will clean up any +mess left over during her leadership activation. + + +ZooKeeper messaging operates similar to a classic two-phase commit. + + + + + + + + +All communication channels are FIFO, so everything is done in order. Specifically +the following operating constraints are observed: + + + +The leader sends proposals to all followers using +the same order. Moreover, this order follows the order in which requests have been +received. Because we use FIFO channels this means that followers also receive proposals in order. + + +Followers process messages in the order they are received. This +means that messages will be ACKed in order and the leader will receive ACKs from +followers in order, due to the FIFO channels. It also means that if message $m$ +has been written to non-volatile storage, all messages that were proposed before +$m$ have been written to non-volatile storage. + +The leader will issue a COMMIT to all followers as soon as a +quorum of followers have ACKed a message. Since messages are ACKed in order, +COMMITs will be sent by the leader as received by the followers in order. + +COMMITs are processed in order. Followers deliver a proposals +message when that proposal is committed. + + + +
+ +
+Summary +So there you go. Why does it work? Specifically, why does is set of proposals +believed by a new leader always contain any proposal that has actually been committed? +First, all proposals have a unique zxid, so unlike other protocols, we never have +to worry about two different values being proposed for the same zxid; followers +(a leader is also a follower) see and record proposals in order; proposals are +committed in order; there is only one active leader at a time since followers only +follow a single leader at a time; a new leader has seen all committed proposals +from the previous epoch since it has seen the highest zxid from a quorum of servers; +any uncommited proposals from a previous epoch seen by a new leader will be committed +by that leader before it becomes active.
+ +
Comparisons + +Isn't this just Multi-Paxos? No, Multi-Paxos requires some way of assuring that +there is only a single coordinator. We do not count on such assurances. Instead +we use the leader activation to recover from leadership change or old leaders +believing they are still active. + + + +Isn't this just Paxos? Your active messaging phase looks just like phase 2 of Paxos? +Actually, to us active messaging looks just like 2 phase commit without the need to +handle aborts. Active messaging is different from both in the sense that it has +cross proposal ordering requirements. If we do not maintain strict FIFO ordering of +all packets, it all falls apart. Also, our leader activation phase is different from +both of them. In particular, our use of epochs allows us to skip blocks of uncommitted +proposals and to not worry about duplicate proposals for a given zxid. + + +
+ +
+ +
+Quorums + + +Atomic broadcast and leader election use the notion of quorum to guarantee a consistent +view of the system. By default, ZooKeeper uses majority quorums, which means that every +voting that happens in one of these protocols requires a majority to vote on. One example is +acknowledging a leader proposal: the leader can only commit once it receives an +acknowledgement from a quorum of servers. + + + +If we extract the properties that we really need from our use of majorities, we have that we only +need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging +a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property. +However, there are other ways of constructing quorums different from majorities. For example, we can assign +weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum, +we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights. + + + +A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical +one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form +a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, +the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables +smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each +server, then we are able to form quorums of size 4. Note that two subsets of processes composed each of a majority +of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect +that a majority of co-locations will have a majority of servers available with high probability. + + + +With ZooKeeper, we provide a user with the ability of configuring servers to use majority quorums, weights, or a +hierarchy of groups. + +
+ +
+ +Logging + +Zookeeper uses +slf4j as an abstraction layer for logging. +log4j in version 1.2 is chosen as the final logging implementation for now. +For better embedding support, it is planned in the future to leave the decision of choosing the final logging implementation to the end user. +Therefore, always use the slf4j api to write log statements in the code, but configure log4j for how to log at runtime. +Note that slf4j has no FATAL level, former messages at FATAL level have been moved to ERROR level. +For information on configuring log4j for +ZooKeeper, see the Logging section +of the ZooKeeper Administrator's Guide. + + + +
Developer Guidelines + +Please follow the +slf4j manual when creating log statements within code. +Also read the +FAQ on performance +, when creating log statements. Patch reviewers will look for the following: +
Logging at the Right Level + +There are several levels of logging in slf4j. +It's important to pick the right one. In order of higher to lower severity: + + ERROR level designates error events that might still allow the application to continue running. + WARN level designates potentially harmful situations. + INFO level designates informational messages that highlight the progress of the application at coarse-grained level. + DEBUG Level designates fine-grained informational events that are most useful to debug an application. + TRACE Level designates finer-grained informational events than the DEBUG. + + + +ZooKeeper is typically run in production such that log messages of INFO level +severity and higher (more severe) are output to the log. + + +
+ +
Use of Standard slf4j Idioms + +Static Message Logging + +LOG.debug("process completed successfully!"); + + + +However when creating parameterized messages are required, use formatting anchors. + + + +LOG.debug("got {} messages in {} minutes",new Object[]{count,time}); + + + +Naming + + +Loggers should be named after the class in which they are used. + + + +public class Foo { + private static final Logger LOG = LoggerFactory.getLogger(Foo.class); + .... + public Foo() { + LOG.info("constructing Foo"); + + +Exception handling + +try { + // code +} catch (XYZException e) { + // do this + LOG.error("Something bad happened", e); + // don't do this (generally) + // LOG.error(e); + // why? because "don't do" case hides the stack trace + + // continue process here as you need... recover or (re)throw +} + +
+
+ +
+ +
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/c1efa954/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml new file mode 100644 index 0000000..f0ea636 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml @@ -0,0 +1,236 @@ + + + + +
+ ZooKeeper JMX + + + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License. + + + + ZooKeeper support for JMX + + + +
+ JMX + Apache ZooKeeper has extensive support for JMX, allowing you + to view and manage a ZooKeeper serving ensemble. + + This document assumes that you have basic knowledge of + JMX. See + Sun JMX Technology page to get started with JMX. + + + See the + JMX Management Guide for details on setting up local and + remote management of VM instances. By default the included + zkServer.sh supports only local management - + review the linked document to enable support for remote management + (beyond the scope of this document). + + +
+ +
+ Starting ZooKeeper with JMX enabled + + The class + org.apache.zookeeper.server.quorum.QuorumPeerMain + will start a JMX manageable ZooKeeper server. This class + registers the proper MBeans during initalization to support JMX + monitoring and management of the + instance. See bin/zkServer.sh for one + example of starting ZooKeeper using QuorumPeerMain. +
+ +
+ Run a JMX console + + There are a number of JMX consoles available which can connect + to the running server. For this example we will use Sun's + jconsole. + + The Java JDK ships with a simple JMX console + named jconsole + which can be used to connect to ZooKeeper and inspect a running + server. Once you've started ZooKeeper using QuorumPeerMain + start jconsole, which typically resides in + JDK_HOME/bin/jconsole + + When the "new connection" window is displayed either connect + to local process (if jconsole started on same host as Server) or + use the remote process connection. + + By default the "overview" tab for the VM is displayed (this + is a great way to get insight into the VM btw). Select + the "MBeans" tab. + + You should now see org.apache.ZooKeeperService + on the left hand side. Expand this item and depending on how you've + started the server you will be able to monitor and manage various + service related features. + + Also note that ZooKeeper will register log4j MBeans as + well. In the same section along the left hand side you will see + "log4j". Expand that to manage log4j through JMX. Of particular + interest is the ability to dynamically change the logging levels + used by editing the appender and root thresholds. Log4j MBean + registration can be disabled by passing + -Dzookeeper.jmx.log4j.disable=true to the JVM + when starting ZooKeeper. + + +
+ +
+ ZooKeeper MBean Reference + + This table details JMX for a server participating in a + replicated ZooKeeper ensemble (ie not standalone). This is the + typical case for a production environment. + + + MBeans, their names and description + + + + + MBean + MBean Object Name + Description + + + + + Quorum + ReplicatedServer_id<#> + Represents the Quorum, or Ensemble - parent of all + cluster members. Note that the object name includes the + "myid" of the server (name suffix) that your JMX agent has + connected to. + + + LocalPeer|RemotePeer + replica.<#> + Represents a local or remote peer (ie server + participating in the ensemble). Note that the object name + includes the "myid" of the server (name suffix). + + + LeaderElection + LeaderElection + Represents a ZooKeeper cluster leader election which is + in progress. Provides information about the election, such as + when it started. + + + Leader + Leader + Indicates that the parent replica is the leader and + provides attributes/operations for that server. Note that + Leader is a subclass of ZooKeeperServer, so it provides + all of the information normally associated with a + ZooKeeperServer node. + + + Follower + Follower + Indicates that the parent replica is a follower and + provides attributes/operations for that server. Note that + Follower is a subclass of ZooKeeperServer, so it provides + all of the information normally associated with a + ZooKeeperServer node. + + + DataTree + InMemoryDataTree + Statistics on the in memory znode database, also + operations to access finer (and more computationally + intensive) statistics on the data (such as ephemeral + count). InMemoryDataTrees are children of ZooKeeperServer + nodes. + + + ServerCnxn + <session_id> + Statistics on each client connection, also + operations on those connections (such as + termination). Note the object name is the session id of + the connection in hex form. + +
+ + This table details JMX for a standalone server. Typically + standalone is only used in development situations. + + + MBeans, their names and description + + + + + MBean + MBean Object Name + Description + + + + + ZooKeeperServer + StandaloneServer_port<#> + Statistics on the running server, also operations + to reset these attributes. Note that the object name + includes the client port of the server (name + suffix). + + + DataTree + InMemoryDataTree + Statistics on the in memory znode database, also + operations to access finer (and more computationally + intensive) statistics on the data (such as ephemeral + count). + + + ServerCnxn + <session_id> + Statistics on each client connection, also + operations on those connections (such as + termination). Note the object name is the session id of + the connection in hex form. + +
+ +
+ +
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/c1efa954/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml new file mode 100644 index 0000000..3955f3d --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml @@ -0,0 +1,145 @@ + + + + +
+ ZooKeeper Observers + + + + Licensed under the Apache License, Version 2.0 (the "License"); you + may not use this file except in compliance with the License. You may + obtain a copy of the License + at http://www.apache.org/licenses/LICENSE-2.0. + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + + + This guide contains information about using non-voting servers, or + observers in your ZooKeeper ensembles. + + + +
+ Observers: Scaling ZooKeeper Without Hurting Write Performance + + + Although ZooKeeper performs very well by having clients connect directly + to voting members of the ensemble, this architecture makes it hard to + scale out to huge numbers of clients. The problem is that as we add more + voting members, the write performance drops. This is due to the fact that + a write operation requires the agreement of (in general) at least half the + nodes in an ensemble and therefore the cost of a vote can increase + significantly as more voters are added. + + + We have introduced a new type of ZooKeeper node called + an Observer which helps address this problem and + further improves ZooKeeper's scalability. Observers are non-voting members + of an ensemble which only hear the results of votes, not the agreement + protocol that leads up to them. Other than this simple distinction, + Observers function exactly the same as Followers - clients may connect to + them and send read and write requests to them. Observers forward these + requests to the Leader like Followers do, but they then simply wait to + hear the result of the vote. Because of this, we can increase the number + of Observers as much as we like without harming the performance of votes. + + + Observers have other advantages. Because they do not vote, they are not a + critical part of the ZooKeeper ensemble. Therefore they can fail, or be + disconnected from the cluster, without harming the availability of the + ZooKeeper service. The benefit to the user is that Observers may connect + over less reliable network links than Followers. In fact, Observers may be + used to talk to a ZooKeeper server from another data center. Clients of + the Observer will see fast reads, as all reads are served locally, and + writes result in minimal network traffic as the number of messages + required in the absence of the vote protocol is smaller. + +
+
+ How to use Observers + Setting up a ZooKeeper ensemble that uses Observers is very simple, + and requires just two changes to your config files. Firstly, in the config + file of every node that is to be an Observer, you must place this line: + + + peerType=observer + + + + This line tells ZooKeeper that the server is to be an Observer. Secondly, + in every server config file, you must add :observer to the server + definition line of each Observer. For example: + + + + server.1:localhost:2181:3181:observer + + + + This tells every other server that server.1 is an Observer, and that they + should not expect it to vote. This is all the configuration you need to do + to add an Observer to your ZooKeeper cluster. Now you can connect to it as + though it were an ordinary Follower. Try it out, by running: + + $ bin/zkCli.sh -server localhost:2181 + + + where localhost:2181 is the hostname and port number of the Observer as + specified in every config file. You should see a command line prompt + through which you can issue commands like ls to query + the ZooKeeper service. + +
+ +
+ Example use cases + + Two example use cases for Observers are listed below. In fact, wherever + you wish to scale the numbe of clients of your ZooKeeper ensemble, or + where you wish to insulate the critical part of an ensemble from the load + of dealing with client requests, Observers are a good architectural + choice. + + + + As a datacenter bridge: Forming a ZK ensemble between two + datacenters is a problematic endeavour as the high variance in latency + between the datacenters could lead to false positive failure detection + and partitioning. However if the ensemble runs entirely in one + datacenter, and the second datacenter runs only Observers, partitions + aren't problematic as the ensemble remains connected. Clients of the + Observers may still see and issue proposals. + + + As a link to a message bus: Some companies have expressed an + interest in using ZK as a component of a persistent reliable message + bus. Observers would give a natural integration point for this work: a + plug-in mechanism could be used to attach the stream of proposals an + Observer sees to a publish-subscribe system, again without loading the + core ensemble. + + + +
+
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/c1efa954/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml new file mode 100644 index 0000000..a2445b1 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml @@ -0,0 +1,46 @@ + + + + +
+ ZooKeeper + + + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License. + + + + currently empty + + + +
+ Other Info + currently empty +
+
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/c1efa954/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml new file mode 100644 index 0000000..7a0444c --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml @@ -0,0 +1,464 @@ + + + + +
+ ZooKeeper + + + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License. + + + + This document contains overview information about ZooKeeper. It + discusses design goals, key concepts, implementation, and + performance. + + + +
+ ZooKeeper: A Distributed Coordination Service for Distributed + Applications + + ZooKeeper is a distributed, open-source coordination service for + distributed applications. It exposes a simple set of primitives that + distributed applications can build upon to implement higher level services + for synchronization, configuration maintenance, and groups and naming. It + is designed to be easy to program to, and uses a data model styled after + the familiar directory tree structure of file systems. It runs in Java and + has bindings for both Java and C. + + Coordination services are notoriously hard to get right. They are + especially prone to errors such as race conditions and deadlock. The + motivation behind ZooKeeper is to relieve distributed applications the + responsibility of implementing coordination services from scratch. + +
+ Design Goals + + ZooKeeper is simple. ZooKeeper + allows distributed processes to coordinate with each other through a + shared hierarchal namespace which is organized similarly to a standard + file system. The name space consists of data registers - called znodes, + in ZooKeeper parlance - and these are similar to files and directories. + Unlike a typical file system, which is designed for storage, ZooKeeper + data is kept in-memory, which means ZooKeeper can acheive high + throughput and low latency numbers. + + The ZooKeeper implementation puts a premium on high performance, + highly available, strictly ordered access. The performance aspects of + ZooKeeper means it can be used in large, distributed systems. The + reliability aspects keep it from being a single point of failure. The + strict ordering means that sophisticated synchronization primitives can + be implemented at the client. + + ZooKeeper is replicated. Like the + distributed processes it coordinates, ZooKeeper itself is intended to be + replicated over a sets of hosts called an ensemble. + +
+ ZooKeeper Service + + + + + + +
+ + The servers that make up the ZooKeeper service must all know about + each other. They maintain an in-memory image of state, along with a + transaction logs and snapshots in a persistent store. As long as a + majority of the servers are available, the ZooKeeper service will be + available. + + Clients connect to a single ZooKeeper server. The client maintains + a TCP connection through which it sends requests, gets responses, gets + watch events, and sends heart beats. If the TCP connection to the server + breaks, the client will connect to a different server. + + ZooKeeper is ordered. ZooKeeper + stamps each update with a number that reflects the order of all + ZooKeeper transactions. Subsequent operations can use the order to + implement higher-level abstractions, such as synchronization + primitives. + + ZooKeeper is fast. It is + especially fast in "read-dominant" workloads. ZooKeeper applications run + on thousands of machines, and it performs best where reads are more + common than writes, at ratios of around 10:1. +
+ +
+ Data model and the hierarchical namespace + + The name space provided by ZooKeeper is much like that of a + standard file system. A name is a sequence of path elements separated by + a slash (/). Every node in ZooKeeper's name space is identified by a + path. + +
+ ZooKeeper's Hierarchical Namespace + + + + + + +
+
+ +
+ Nodes and ephemeral nodes + + Unlike is standard file systems, each node in a ZooKeeper + namespace can have data associated with it as well as children. It is + like having a file-system that allows a file to also be a directory. + (ZooKeeper was designed to store coordination data: status information, + configuration, location information, etc., so the data stored at each + node is usually small, in the byte to kilobyte range.) We use the term + znode to make it clear that we are talking about + ZooKeeper data nodes. + + Znodes maintain a stat structure that includes version numbers for + data changes, ACL changes, and timestamps, to allow cache validations + and coordinated updates. Each time a znode's data changes, the version + number increases. For instance, whenever a client retrieves data it also + receives the version of the data. + + The data stored at each znode in a namespace is read and written + atomically. Reads get all the data bytes associated with a znode and a + write replaces all the data. Each node has an Access Control List (ACL) + that restricts who can do what. + + ZooKeeper also has the notion of ephemeral nodes. These znodes + exists as long as the session that created the znode is active. When the + session ends the znode is deleted. Ephemeral nodes are useful when you + want to implement [tbd]. +
+ +
+ Conditional updates and watches + + ZooKeeper supports the concept of watches. + Clients can set a watch on a znodes. A watch will be triggered and + removed when the znode changes. When a watch is triggered the client + receives a packet saying that the znode has changed. And if the + connection between the client and one of the Zoo Keeper servers is + broken, the client will receive a local notification. These can be used + to [tbd]. +
+ +
+ Guarantees + + ZooKeeper is very fast and very simple. Since its goal, though, is + to be a basis for the construction of more complicated services, such as + synchronization, it provides a set of guarantees. These are: + + + + Sequential Consistency - Updates from a client will be applied + in the order that they were sent. + + + + Atomicity - Updates either succeed or fail. No partial + results. + + + + Single System Image - A client will see the same view of the + service regardless of the server that it connects to. + + + + + + Reliability - Once an update has been applied, it will persist + from that time forward until a client overwrites the update. + + + + + + Timeliness - The clients view of the system is guaranteed to + be up-to-date within a certain time bound. + + + + For more information on these, and how they can be used, see + [tbd] +
+ +
+ Simple API + + One of the design goals of ZooKeeper is provide a very simple + programming interface. As a result, it supports only these + operations: + + + + create + + + creates a node at a location in the tree + + + + + delete + + + deletes a node + + + + + exists + + + tests if a node exists at a location + + + + + get data + + + reads the data from a node + + + + + set data + + + writes data to a node + + + + + get children + + + retrieves a list of children of a node + + + + + sync + + + waits for data to be propagated + + + + + For a more in-depth discussion on these, and how they can be used + to implement higher level operations, please refer to + [tbd] +
+ +
+ Implementation + + shows the high-level components + of the ZooKeeper service. With the exception of the request processor, + each of + the servers that make up the ZooKeeper service replicates its own copy + of each of components. + +
+ ZooKeeper Components + + + + + + +
+ + The replicated database is an in-memory database containing the + entire data tree. Updates are logged to disk for recoverability, and + writes are serialized to disk before they are applied to the in-memory + database. + + Every ZooKeeper server services clients. Clients connect to + exactly one server to submit irequests. Read requests are serviced from + the local replica of each server database. Requests that change the + state of the service, write requests, are processed by an agreement + protocol. + + As part of the agreement protocol all write requests from clients + are forwarded to a single server, called the + leader. The rest of the ZooKeeper servers, called + followers, receive message proposals from the + leader and agree upon message delivery. The messaging layer takes care + of replacing leaders on failures and syncing followers with + leaders. + + ZooKeeper uses a custom atomic messaging protocol. Since the + messaging layer is atomic, ZooKeeper can guarantee that the local + replicas never diverge. When the leader receives a write request, it + calculates what the state of the system is when the write is to be + applied and transforms this into a transaction that captures this new + state. +
+ +
+ Uses + + The programming interface to ZooKeeper is deliberately simple. + With it, however, you can implement higher order operations, such as + synchronizations primitives, group membership, ownership, etc. Some + distributed applications have used it to: [tbd: add uses from + white paper and video presentation.] For more information, see + [tbd] +
+ +
+ Performance + + ZooKeeper is designed to be highly performant. But is it? The + results of the ZooKeeper's development team at Yahoo! Research indicate + that it is. (See .) It is especially high + performance in applications where reads outnumber writes, since writes + involve synchronizing the state of all servers. (Reads outnumbering + writes is typically the case for a coordination service.) + +
+ ZooKeeper Throughput as the Read-Write Ratio Varies + + + + + + +
+ The figure is a throughput + graph of ZooKeeper release 3.2 running on servers with dual 2Ghz + Xeon and two SATA 15K RPM drives. One drive was used as a + dedicated ZooKeeper log device. The snapshots were written to + the OS drive. Write requests were 1K writes and the reads were + 1K reads. "Servers" indicate the size of the ZooKeeper + ensemble, the number of servers that make up the + service. Approximately 30 other servers were used to simulate + the clients. The ZooKeeper ensemble was configured such that + leaders do not allow connections from clients. + + In version 3.2 r/w performance improved by ~2x + compared to the previous + 3.1 release. + + Benchmarks also indicate that it is reliable, too. shows how a deployment responds to + various failures. The events marked in the figure are the + following: + + + + Failure and recovery of a follower + + + + Failure and recovery of a different follower + + + + Failure of the leader + + + + Failure and recovery of two followers + + + + Failure of another leader + + +
+ +
+ Reliability + + To show the behavior of the system over time as + failures are injected we ran a ZooKeeper service made up of + 7 machines. We ran the same saturation benchmark as before, + but this time we kept the write percentage at a constant + 30%, which is a conservative ratio of our expected + workloads. + +
+ Reliability in the Presence of Errors + + + + + +
+ + The are a few important observations from this graph. First, if + followers fail and recover quickly, then ZooKeeper is able to sustain a + high throughput despite the failure. But maybe more importantly, the + leader election algorithm allows for the system to recover fast enough + to prevent throughput from dropping substantially. In our observations, + ZooKeeper takes less than 200ms to elect a new leader. Third, as + followers recover, ZooKeeper is able to raise throughput again once they + start processing requests. +
+ +
+ The ZooKeeper Project + + ZooKeeper has been + + successfully used + + in many industrial applications. It is used at Yahoo! as the + coordination and failure recovery service for Yahoo! Message + Broker, which is a highly scalable publish-subscribe system + managing thousands of topics for replication and data + delivery. It is used by the Fetching Service for Yahoo! + crawler, where it also manages failure recovery. A number of + Yahoo! advertising systems also use ZooKeeper to implement + reliable services. + + + All users and developers are encouraged to join the + community and contribute their expertise. See the + + Zookeeper Project on Apache + + for more information. + +
+
+