zookeeper-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From an...@apache.org
Subject [04/13] zookeeper git commit: ZOOKEEPER-3022: MAVEN MIGRATION - Iteration 1 - docs, it
Date Wed, 04 Jul 2018 11:02:32 GMT
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml
new file mode 100644
index 0000000..f71c4a8
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml
@@ -0,0 +1,75 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="zk_HierarchicalQuorums">
+  <title>Introduction to hierarchical quorums</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License. You may
+      obtain a copy of the License at <ulink
+      url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+
+      <para>Unless required by applicable law or agreed to in writing,
+      software distributed under the License is distributed on an "AS IS"
+      BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied. See the License for the specific language governing permissions
+      and limitations under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para>This document contains information about hierarchical quorums.</para>
+    </abstract>
+  </articleinfo>
+
+    <para>
+    This document gives an example of how to use hierarchical quorums. The basic idea is
+    very simple. First, we split servers into groups, and add a line for each group listing
+    the servers that form this group. Next we have to assign a weight to each server.  
+    </para>
+    
+    <para>
+    The following example shows how to configure a system with three groups of three servers
+    each, and we assign a weight of 1 to each server:
+    </para>
+    
+    <programlisting>
+    group.1=1:2:3
+    group.2=4:5:6
+    group.3=7:8:9
+   
+    weight.1=1
+    weight.2=1
+    weight.3=1
+    weight.4=1
+    weight.5=1
+    weight.6=1
+    weight.7=1
+    weight.8=1
+    weight.9=1
+ 	</programlisting>
+
+	<para>    
+    When running the system, we are able to form a quorum once we have a majority of votes from
+    a majority of non-zero-weight groups. Groups that have zero weight are discarded and not
+    considered when forming quorums. Looking at the example, we are able to form a quorum once
+    we have votes from at least two servers from each of two different groups.
+    </para> 
+ </article>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml
new file mode 100644
index 0000000..7815bc1
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml
@@ -0,0 +1,487 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="ar_ZooKeeperInternals">
+  <title>ZooKeeper Internals</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License. You may
+      obtain a copy of the License at <ulink
+      url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+
+      <para>Unless required by applicable law or agreed to in writing,
+      software distributed under the License is distributed on an "AS IS"
+      BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied. See the License for the specific language governing permissions
+      and limitations under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para>This article contains topics which discuss the inner workings of
+       ZooKeeper. So far, that's logging and atomic broadcast. </para>
+
+    </abstract>
+  </articleinfo>
+
+  <section id="ch_Introduction">
+    <title>Introduction</title>
+
+    <para>This document contains information on the inner workings of ZooKeeper. 
+    So far, it discusses these topics:
+    </para>
+
+<itemizedlist>    
+<listitem><para><xref linkend="sc_atomicBroadcast"/></para></listitem>
+<listitem><para><xref linkend="sc_logging"/></para></listitem>
+</itemizedlist>
+
+</section>
+
+<section id="sc_atomicBroadcast">
+<title>Atomic Broadcast</title>
+
+<para>
+At the heart of ZooKeeper is an atomic messaging system that keeps all of the servers in sync.</para>
+
+<section id="sc_guaranteesPropertiesDefinitions"><title>Guarantees, Properties, and Definitions</title>
+<para>
+The specific guarantees provided by the messaging system used by ZooKeeper are the following:</para>
+
+<variablelist>
+
+<varlistentry><term><emphasis >Reliable delivery</emphasis></term>
+<listitem><para>If a message, m, is delivered 
+by one server, it will be eventually delivered by all servers.</para></listitem></varlistentry>
+
+<varlistentry><term><emphasis >Total order</emphasis></term>
+<listitem><para> If a message is 
+delivered before message b by one server, a will be delivered before b by all 
+servers. If a and b are delivered messages, either a will be delivered before b 
+or b will be delivered before a.</para></listitem></varlistentry>
+
+<varlistentry><term><emphasis >Causal order</emphasis> </term>
+
+<listitem><para>
+If a message b is sent after a message a has been delivered by the sender of b, 
+a must be ordered before b. If a sender sends c after sending b, c must be ordered after b.
+</para></listitem></varlistentry>
+
+</variablelist>
+
+
+<para>
+The ZooKeeper messaging system also needs to be efficient, reliable, and easy to 
+implement and maintain. We make heavy use of messaging, so we need the system to 
+be able to handle thousands of requests per second. Although we can require at 
+least k+1 correct servers to send new messages, we must be able to recover from 
+correlated failures such as power outages. When we implemented the system we had 
+little time and few engineering resources, so we needed a protocol that is 
+accessible to engineers and is easy to implement. We found that our protocol 
+satisfied all of these goals.
+
+</para>
+
+<para>
+Our protocol assumes that we can construct point-to-point FIFO channels between 
+the servers. While similar services usually assume message delivery that can 
+lose or reorder messages, our assumption of FIFO channels is very practical 
+given that we use TCP for communication. Specifically we rely on the following property of TCP:</para>
+
+<variablelist>
+
+<varlistentry>
+<term><emphasis >Ordered delivery</emphasis></term>
+<listitem><para>Data is delivered in the same order it is sent and a message m is 
+delivered only after all messages sent before m have been delivered. 
+(The corollary to this is that if message m is lost all messages after m will be lost.)</para></listitem></varlistentry>
+
+<varlistentry><term><emphasis >No message after close</emphasis></term>
+<listitem><para>Once a FIFO channel is closed, no messages will be received from it.</para></listitem></varlistentry>
+
+</variablelist>
+
+<para>
+FLP proved that consensus cannot be achieved in asynchronous distributed systems 
+if failures are possible. To ensure we achieve consensus in the presence of failures 
+we use timeouts. However, we rely on times for liveness not for correctness. So, 
+if timeouts stop working (clocks malfunction for example) the messaging system may 
+hang, but it will not violate its guarantees.</para>
+
+<para>When describing the ZooKeeper messaging protocol we will talk of packets, 
+proposals, and messages:</para>
+<variablelist>
+<varlistentry><term><emphasis >Packet</emphasis></term>
+<listitem><para>a sequence of bytes sent through a FIFO channel</para></listitem></varlistentry><varlistentry>
+
+<term><emphasis >Proposal</emphasis></term>
+<listitem><para>a unit of agreement. Proposals are agreed upon by exchanging packets 
+with a quorum of ZooKeeper servers. Most proposals contain messages, however the 
+NEW_LEADER proposal is an example of a proposal that does not correspond to a message.</para></listitem>
+</varlistentry><varlistentry>
+
+<term><emphasis >Message</emphasis></term>
+<listitem><para>a sequence of bytes to be atomically broadcast to all ZooKeeper 
+servers. A message put into a proposal and agreed upon before it is delivered.</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
+As stated above, ZooKeeper guarantees a total order of messages, and it also 
+guarantees a total order of proposals. ZooKeeper exposes the total ordering using
+a ZooKeeper transaction id (<emphasis>zxid</emphasis>). All proposals will be stamped with a zxid when 
+it is proposed and exactly reflects the total ordering. Proposals are sent to all 
+ZooKeeper servers and committed when a quorum of them acknowledge the proposal. 
+If a proposal contains a message, the message will be delivered when the proposal 
+is committed. Acknowledgement means the server has recorded the proposal to persistent storage. 
+Our quorums have the requirement that any pair of quorum must have at least one server 
+in common. We ensure this by requiring that all quorums have size (<emphasis>n/2+1</emphasis>) where 
+n is the number of servers that make up a ZooKeeper service.
+</para>
+
+<para>
+The zxid has two parts: the epoch and a counter. In our implementation the zxid 
+is a 64-bit number. We use the high order 32-bits for the epoch and the low order 
+32-bits for the counter. Because it has two parts represent the zxid both as a 
+number and as a pair of integers, (<emphasis>epoch, count</emphasis>). The epoch number represents a 
+change in leadership. Each time a new leader comes into power it will have its 
+own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: 
+the leader simply increments the zxid to obtain a unique zxid for each proposal. 
+<emphasis>Leadership activation will ensure that only one leader uses a given epoch, so our 
+simple algorithm guarantees that every proposal will have a unique id.</emphasis>
+</para>
+
+<para>
+ZooKeeper messaging consists of two phases:</para>
+
+<variablelist>
+<varlistentry><term><emphasis >Leader activation</emphasis></term>
+<listitem><para>In this phase a leader establishes the correct state of the system 
+and gets ready to start making proposals.</para></listitem>
+</varlistentry>
+
+<varlistentry><term><emphasis >Active messaging</emphasis></term>
+<listitem><para>In this phase a leader accepts messages to propose and coordinates message delivery.</para></listitem>
+</varlistentry>
+</variablelist>
+
+<para>
+ZooKeeper is a holistic protocol. We do not focus on individual proposals, rather 
+look at the stream of proposals as a whole. Our strict ordering allows us to do this 
+efficiently and greatly simplifies our protocol. Leadership activation embodies 
+this holistic concept. A leader becomes active only when a quorum of followers 
+(The leader counts as a follower as well. You can always vote for yourself ) has synced 
+up with the leader, they have the same state. This state consists of all of the 
+proposals that the leader believes have been committed and the proposal to follow 
+the leader, the NEW_LEADER proposal. (Hopefully you are thinking to 
+yourself, <emphasis>Does the set of proposals that the leader believes has been committed 
+included all the proposals that really have been committed?</emphasis> The answer is <emphasis>yes</emphasis>. 
+Below, we make clear why.)
+</para>
+
+</section>
+
+<section id="sc_leaderElection">
+
+<title>Leader Activation</title>
+<para>
+Leader activation includes leader election. We currently have two leader election 
+algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection 
+is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple
+form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the 
+exact method of electing a leader has long as the following holds:
+</para>
+
+<itemizedlist>
+
+<listitem><para>The leader has seen the highest zxid of all the followers.</para></listitem>
+<listitem><para>A quorum of servers have committed to following the leader.</para></listitem>
+
+</itemizedlist>
+
+<para>
+Of these two requirements only the first, the highest zxid amoung the followers 
+needs to hold for correct operation. The second requirement, a quorum of followers, 
+just needs to hold with high probability. We are going to recheck the second requirement, 
+so if a failure happens during or after the leader election and quorum is lost, 
+we will recover by abandoning leader activation and running another election.
+</para>
+
+<para>
+After leader election a single server will be designated as a leader and start 
+waiting for followers to connect. The rest of the servers will try to connect to 
+the leader. The leader will sync up with followers by sending any proposals they 
+are missing, or if a follower is missing too many proposals, it will send a full 
+snapshot of the state to the follower.
+</para>
+
+<para>
+There is a corner case in which a follower that has proposals, U, not seen 
+by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids 
+higher than zxids seen by the leader. The follower must have arrived after the 
+leader election, otherwise the follower would have been elected leader given that 
+it has seen a higher zxid. Since committed proposals must be seen by a quorum of 
+servers, and a quorum of servers that elected the leader did not see U, the proposals 
+of you have not been committed, so they can be discarded. When the follower connects 
+to the leader, the leader will tell the follower to discard U.
+</para>
+
+<para>
+A new leader establishes a zxid to start using for new proposals by getting the 
+epoch, e, of the highest zxid it has seen and setting the next zxid to use to be 
+(e+1, 0), fter the leader syncs with a follower, it will propose a NEW_LEADER 
+proposal. Once the NEW_LEADER proposal has been committed, the leader will activate 
+and start receiving and issuing proposals.
+</para>
+
+<para>
+It all sounds complicated but here are the basic rules of operation during leader 
+activation:
+</para>
+
+<itemizedlist>
+<listitem><para>A follower will ACK the NEW_LEADER proposal after it has synced with the leader.</para></listitem>
+<listitem><para>A follower will only ACK a NEW_LEADER proposal with a given zxid from a single server.</para></listitem>
+<listitem><para>A new leader will COMMIT the NEW_LEADER proposal when a quorum of followers have ACKed it.</para></listitem>
+<listitem><para>A follower will commit any state it received from the leader when the NEW_LEADER proposal is COMMIT.</para></listitem>
+<listitem><para>A new leader will not accept new proposals until the NEW_LEADER proposal has been COMMITED.</para></listitem>
+</itemizedlist>
+
+<para>
+If leader election terminates erroneously, we don't have a problem since the 
+NEW_LEADER proposal will not be committed since the leader will not have quorum. 
+When this happens, the leader and any remaining followers will timeout and go back 
+to leader election.
+</para>
+
+</section>
+
+<section id="sc_activeMessaging">
+<title>Active Messaging</title>
+<para>
+Leader Activation does all the heavy lifting. Once the leader is coronated he can 
+start blasting out proposals. As long as he remains the leader no other leader can 
+emerge since no other leader will be able to get a quorum of followers. If a new 
+leader does emerge, 
+it means that the leader has lost quorum, and the new leader will clean up any 
+mess left over during her leadership activation.
+</para>
+
+<para>ZooKeeper messaging operates similar to a classic two-phase commit.</para>
+
+<mediaobject id="fg_2phaseCommit" >
+  <imageobject>
+    <imagedata fileref="images/2pc.jpg"/>
+  </imageobject>
+</mediaobject>
+
+<para>
+All communication channels are FIFO, so everything is done in order. Specifically 
+the following operating constraints are observed:</para>
+
+<itemizedlist>
+
+<listitem><para>The leader sends proposals to all followers using 
+the same order. Moreover, this order follows the order in which requests have been 
+received. Because we use FIFO channels this means that followers also receive proposals in order.
+</para></listitem>
+
+<listitem><para>Followers process messages in the order they are received. This 
+means that messages will be ACKed in order and the leader will receive ACKs from 
+followers in order, due to the FIFO channels. It also means that if message $m$ 
+has been written to non-volatile storage, all messages that were proposed before 
+$m$ have been written to non-volatile storage.</para></listitem>
+
+<listitem><para>The leader will issue a COMMIT to all followers as soon as a 
+quorum of followers have ACKed a message. Since messages are ACKed in order, 
+COMMITs will be sent by the leader as received by the followers in order.</para></listitem>
+
+<listitem><para>COMMITs are processed in order. Followers deliver a proposals 
+message when that proposal is committed.</para></listitem>
+
+</itemizedlist>
+
+</section>
+
+<section id="sc_summary">
+<title>Summary</title>
+<para>So there you go. Why does it work? Specifically, why does a set of proposals 
+believed by a new leader always contain any proposal that has actually been committed? 
+First, all proposals have a unique zxid, so unlike other protocols, we never have 
+to worry about two different values being proposed for the same zxid; followers 
+(a leader is also a follower) see and record proposals in order; proposals are 
+committed in order; there is only one active leader at a time since followers only 
+follow a single leader at a time; a new leader has seen all committed proposals 
+from the previous epoch since it has seen the highest zxid from a quorum of servers; 
+any uncommited proposals from a previous epoch seen by a new leader will be committed 
+by that leader before it becomes active.</para></section>
+
+<section id="sc_comparisons"><title>Comparisons</title>
+<para>
+Isn't this just Multi-Paxos? No, Multi-Paxos requires some way of assuring that 
+there is only a single coordinator. We do not count on such assurances. Instead 
+we use the leader activation to recover from leadership change or old leaders 
+believing they are still active.
+</para>
+
+<para>
+Isn't this just Paxos? Your active messaging phase looks just like phase 2 of Paxos? 
+Actually, to us active messaging looks just like 2 phase commit without the need to 
+handle aborts. Active messaging is different from both in the sense that it has 
+cross proposal ordering requirements. If we do not maintain strict FIFO ordering of 
+all packets, it all falls apart. Also, our leader activation phase is different from 
+both of them. In particular, our use of epochs allows us to skip blocks of uncommitted
+proposals and to not worry about duplicate proposals for a given zxid.
+</para>
+
+</section>
+
+</section>
+
+<section id="sc_quorum">
+<title>Quorums</title>
+
+<para>
+Atomic broadcast and leader election use the notion of quorum to guarantee a consistent
+view of the system. By default, ZooKeeper uses majority quorums, which means that every
+voting that happens in one of these protocols requires a majority to vote on. One example is
+acknowledging a leader proposal: the leader can only commit once it receives an
+acknowledgement from a quorum of servers.
+</para>
+
+<para>
+If we extract the properties that we really need from our use of majorities, we have that we only
+need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging
+a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property.
+However, there are other ways of constructing quorums different from majorities. For example, we can assign
+weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum,
+we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights.    
+</para>
+
+<para>
+A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical
+one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form 
+a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G,
+the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables
+smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each
+server, then we are able to form quorums of size 4. Note that two subsets of processes composed each of a majority
+of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect
+that a majority of co-locations will have a majority of servers available with high probability. 
+</para>  
+
+<para>
+With ZooKeeper, we provide a user with the ability of configuring servers to use majority quorums, weights, or a 
+hierarchy of groups.
+</para>
+</section>
+
+<section id="sc_logging">
+
+<title>Logging</title>
+<para>
+Zookeeper uses 
+<ulink url="http://www.slf4j.org/index.html">slf4j</ulink> as an abstraction layer for logging. 
+<ulink url="http://logging.apache.org/log4j">log4j</ulink> in version 1.2 is chosen as the final logging implementation for now.
+For better embedding support, it is planned in the future to leave the decision of choosing the final logging implementation to the end user.
+Therefore, always use the slf4j api to write log statements in the code, but configure log4j for how to log at runtime.
+Note that slf4j has no FATAL level, former messages at FATAL level have been moved to ERROR level. 
+For information on configuring log4j for
+ZooKeeper, see the <ulink url="zookeeperAdmin.html#sc_logging">Logging</ulink> section 
+of the <ulink url="zookeeperAdmin.html">ZooKeeper Administrator's Guide.</ulink>
+
+</para>
+
+<section id="sc_developerGuidelines"><title>Developer Guidelines</title>
+
+<para>Please follow the  
+<ulink url="http://www.slf4j.org/manual.html">slf4j manual</ulink> when creating log statements within code.
+Also read the
+<ulink url="http://www.slf4j.org/faq.html#logging_performance">FAQ on performance</ulink>
+, when creating log statements. Patch reviewers will look for the following:</para>
+<section id="sc_rightLevel"><title>Logging at the Right Level</title>
+<para>
+There are several levels of logging in slf4j. 
+It's important to pick the right one. In order of higher to lower severity:</para>
+<orderedlist>
+   <listitem><para>ERROR level designates error events that might still allow the application to continue running.</para></listitem>
+   <listitem><para>WARN level designates potentially harmful situations.</para></listitem>
+   <listitem><para>INFO level designates informational messages that highlight the progress of the application at coarse-grained level.</para></listitem>
+   <listitem><para>DEBUG Level designates fine-grained informational events that are most useful to debug an application.</para></listitem>
+   <listitem><para>TRACE Level designates finer-grained informational events than the DEBUG.</para></listitem>
+</orderedlist>
+
+<para>
+ZooKeeper is typically run in production such that log messages of INFO level 
+severity and higher (more severe) are output to the log.</para>
+
+
+</section>
+
+<section id="sc_slf4jIdioms"><title>Use of Standard slf4j Idioms</title>
+
+<para><emphasis>Static Message Logging</emphasis></para>
+<programlisting>
+LOG.debug("process completed successfully!");
+</programlisting>
+
+<para>
+However when creating parameterized messages are required, use formatting anchors.
+</para>
+
+<programlisting>
+LOG.debug("got {} messages in {} minutes",new Object[]{count,time});    
+</programlisting>
+
+
+<para><emphasis>Naming</emphasis></para>
+
+<para>
+Loggers should be named after the class in which they are used.
+</para>
+
+<programlisting>
+public class Foo {
+    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);
+    ....
+    public Foo() {
+       LOG.info("constructing Foo");
+</programlisting>
+
+<para><emphasis>Exception handling</emphasis></para>
+<programlisting>
+try {
+  // code
+} catch (XYZException e) {
+  // do this
+  LOG.error("Something bad happened", e);
+  // don't do this (generally)
+  // LOG.error(e);
+  // why? because "don't do" case hides the stack trace
+ 
+  // continue process here as you need... recover or (re)throw
+}
+</programlisting>
+</section>
+</section>
+
+</section>
+
+</article>

http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml
new file mode 100644
index 0000000..f0ea636
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml
@@ -0,0 +1,236 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="bk_zookeeperjmx">
+  <title>ZooKeeper JMX</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License. You may
+      obtain a copy of the License at <ulink
+      url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+
+      <para>Unless required by applicable law or agreed to in writing,
+      software distributed under the License is distributed on an "AS IS"
+      BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied. See the License for the specific language governing permissions
+      and limitations under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para>ZooKeeper support for JMX</para>
+    </abstract>
+  </articleinfo>
+
+  <section id="ch_jmx">
+    <title>JMX</title>
+    <para>Apache ZooKeeper has extensive support for JMX, allowing you
+    to view and manage a ZooKeeper serving ensemble.</para>
+
+    <para>This document assumes that you have basic knowledge of
+    JMX. See <ulink
+    url="http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/">
+    Sun JMX Technology</ulink> page to get started with JMX.
+    </para>
+
+    <para>See the <ulink
+    url="http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html">
+    JMX Management Guide</ulink> for details on setting up local and
+    remote management of VM instances. By default the included
+    <emphasis>zkServer.sh</emphasis> supports only local management -
+    review the linked document to enable support for remote management
+    (beyond the scope of this document).
+    </para>
+
+  </section>
+
+  <section id="ch_starting">
+    <title>Starting ZooKeeper with JMX enabled</title>
+
+    <para>The class
+      <emphasis>org.apache.zookeeper.server.quorum.QuorumPeerMain</emphasis>
+      will start a JMX manageable ZooKeeper server. This class
+      registers the proper MBeans during initalization to support JMX
+      monitoring and management of the
+      instance. See <emphasis>bin/zkServer.sh</emphasis> for one
+      example of starting ZooKeeper using QuorumPeerMain.</para>
+  </section>
+
+  <section id="ch_console">
+    <title>Run a JMX console</title>
+
+    <para>There are a number of JMX consoles available which can connect
+      to the running server. For this example we will use Sun's
+      <emphasis>jconsole</emphasis>.</para>
+
+    <para>The Java JDK ships with a simple JMX console
+      named <ulink url="http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html">jconsole</ulink>
+      which can be used to connect to ZooKeeper and inspect a running
+      server. Once you've started ZooKeeper using QuorumPeerMain
+      start <emphasis>jconsole</emphasis>, which typically resides in
+      <emphasis>JDK_HOME/bin/jconsole</emphasis></para>
+
+    <para>When the "new connection" window is displayed either connect
+      to local process (if jconsole started on same host as Server) or
+      use the remote process connection.</para>
+
+    <para>By default the "overview" tab for the VM is displayed (this
+      is a great way to get insight into the VM btw). Select
+      the "MBeans" tab.</para>
+
+    <para>You should now see <emphasis>org.apache.ZooKeeperService</emphasis>
+      on the left hand side. Expand this item and depending on how you've
+      started the server you will be able to monitor and manage various
+      service related features.</para>
+
+    <para>Also note that ZooKeeper will register log4j MBeans as
+    well. In the same section along the left hand side you will see
+    "log4j". Expand that to manage log4j through JMX. Of particular
+    interest is the ability to dynamically change the logging levels
+    used by editing the appender and root thresholds. Log4j MBean
+    registration can be disabled by passing
+    <emphasis>-Dzookeeper.jmx.log4j.disable=true</emphasis> to the JVM
+    when starting ZooKeeper.
+    </para>
+
+  </section>
+
+  <section id="ch_reference">
+    <title>ZooKeeper MBean Reference</title>
+
+    <para>This table details JMX for a server participating in a
+    replicated ZooKeeper ensemble (ie not standalone). This is the
+    typical case for a production environment.</para>
+
+    <table>
+      <title>MBeans, their names and description</title>
+
+      <tgroup cols='4'>
+        <thead>
+          <row>
+            <entry>MBean</entry>
+            <entry>MBean Object Name</entry>
+            <entry>Description</entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry>Quorum</entry>
+            <entry>ReplicatedServer_id&lt;#&gt;</entry>
+            <entry>Represents the Quorum, or Ensemble - parent of all
+            cluster members. Note that the object name includes the
+            "myid" of the server (name suffix) that your JMX agent has
+            connected to.</entry>
+          </row>
+          <row>
+            <entry>LocalPeer|RemotePeer</entry>
+            <entry>replica.&lt;#&gt;</entry>
+            <entry>Represents a local or remote peer (ie server
+            participating in the ensemble). Note that the object name
+            includes the "myid" of the server (name suffix).</entry>
+          </row>
+          <row>
+            <entry>LeaderElection</entry>
+            <entry>LeaderElection</entry>
+            <entry>Represents a ZooKeeper cluster leader election which is
+            in progress. Provides information about the election, such as
+            when it started.</entry>
+          </row>
+          <row>
+            <entry>Leader</entry>
+            <entry>Leader</entry>
+            <entry>Indicates that the parent replica is the leader and
+            provides attributes/operations for that server. Note that
+            Leader is a subclass of ZooKeeperServer, so it provides
+            all of the information normally associated with a
+            ZooKeeperServer node.</entry>
+          </row>
+          <row>
+            <entry>Follower</entry>
+            <entry>Follower</entry>
+            <entry>Indicates that the parent replica is a follower and
+            provides attributes/operations for that server. Note that
+            Follower is a subclass of ZooKeeperServer, so it provides
+            all of the information normally associated with a
+            ZooKeeperServer node.</entry>
+          </row>
+          <row>
+            <entry>DataTree</entry>
+            <entry>InMemoryDataTree</entry>
+            <entry>Statistics on the in memory znode database, also
+            operations to access finer (and more computationally
+            intensive) statistics on the data (such as ephemeral
+            count). InMemoryDataTrees are children of ZooKeeperServer
+            nodes.</entry>
+          </row>
+          <row>
+            <entry>ServerCnxn</entry>
+            <entry>&lt;session_id&gt;</entry>
+            <entry>Statistics on each client connection, also
+            operations on those connections (such as
+            termination). Note the object name is the session id of
+            the connection in hex form.</entry>
+          </row>
+    </tbody></tgroup></table>
+
+    <para>This table details JMX for a standalone server. Typically
+    standalone is only used in development situations.</para>
+
+    <table>
+      <title>MBeans, their names and description</title>
+
+      <tgroup cols='4'>
+        <thead>
+          <row>
+            <entry>MBean</entry>
+            <entry>MBean Object Name</entry>
+            <entry>Description</entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry>ZooKeeperServer</entry>
+            <entry>StandaloneServer_port&lt;#&gt;</entry>
+            <entry>Statistics on the running server, also operations
+            to reset these attributes. Note that the object name
+            includes the client port of the server (name
+            suffix).</entry>
+          </row>
+          <row>
+            <entry>DataTree</entry>
+            <entry>InMemoryDataTree</entry>
+            <entry>Statistics on the in memory znode database, also
+            operations to access finer (and more computationally
+            intensive) statistics on the data (such as ephemeral
+            count).</entry>
+          </row>
+          <row>
+            <entry>ServerCnxn</entry>
+            <entry>&lt;session_id&gt;</entry>
+            <entry>Statistics on each client connection, also
+            operations on those connections (such as
+            termination). Note the object name is the session id of
+            the connection in hex form.</entry>
+          </row>
+    </tbody></tgroup></table>
+
+  </section>
+
+</article>

http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml
new file mode 100644
index 0000000..fab6769
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml
@@ -0,0 +1,145 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="bk_GettStartedGuide">
+  <title>ZooKeeper Observers</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License"); you
+	may not use this file except in compliance with the License. You may
+	obtain a copy of the License
+	at <ulink url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+      
+      <para>Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para>This guide contains information about using non-voting servers, or
+      observers in your ZooKeeper ensembles.</para>
+    </abstract>
+  </articleinfo>
+
+  <section id="ch_Introduction">
+    <title>Observers: Scaling ZooKeeper Without Hurting Write Performance
+      </title>
+    <para>
+      Although ZooKeeper performs very well by having clients connect directly
+      to voting members of the ensemble, this architecture makes it hard to
+      scale out to huge numbers of clients. The problem is that as we add more
+      voting members, the write performance drops. This is due to the fact that
+      a write operation requires the agreement of (in general) at least half the
+      nodes in an ensemble and therefore the cost of a vote can increase
+      significantly as more voters are added.
+    </para>
+    <para>
+      We have introduced a new type of ZooKeeper node called
+      an <emphasis>Observer</emphasis> which helps address this problem and
+      further improves ZooKeeper's scalability. Observers are non-voting members
+      of an ensemble which only hear the results of votes, not the agreement
+      protocol that leads up to them. Other than this simple distinction,
+      Observers function exactly the same as Followers - clients may connect to
+      them and send read and write requests to them. Observers forward these
+      requests to the Leader like Followers do, but they then simply wait to
+      hear the result of the vote. Because of this, we can increase the number
+      of Observers as much as we like without harming the performance of votes.
+    </para>
+    <para>
+      Observers have other advantages. Because they do not vote, they are not a
+      critical part of the ZooKeeper ensemble. Therefore they can fail, or be
+      disconnected from the cluster, without harming the availability of the
+      ZooKeeper service. The benefit to the user is that Observers may connect
+      over less reliable network links than Followers. In fact, Observers may be
+      used to talk to a ZooKeeper server from another data center. Clients of
+      the Observer will see fast reads, as all reads are served locally, and
+      writes result in minimal network traffic as the number of messages
+      required in the absence of the vote protocol is smaller.
+    </para>
+  </section>
+  <section id="sc_UsingObservers">
+    <title>How to use Observers</title>
+    <para>Setting up a ZooKeeper ensemble that uses Observers is very simple,
+    and requires just two changes to your config files. Firstly, in the config
+    file of every node that is to be an Observer, you must place this line:
+    </para>
+    <programlisting>
+      peerType=observer
+    </programlisting>
+    
+    <para>
+      This line tells ZooKeeper that the server is to be an Observer. Secondly,
+      in every server config file, you must add :observer to the server
+      definition line of each Observer. For example:
+    </para>
+    
+    <programlisting>
+      server.1:localhost:2181:3181:observer
+    </programlisting>
+    
+    <para>
+      This tells every other server that server.1 is an Observer, and that they
+      should not expect it to vote. This is all the configuration you need to do
+      to add an Observer to your ZooKeeper cluster. Now you can connect to it as
+      though it were an ordinary Follower. Try it out, by running:</para>
+    <programlisting>
+      $ bin/zkCli.sh -server localhost:2181
+    </programlisting>
+    <para>
+      where localhost:2181 is the hostname and port number of the Observer as
+      specified in every config file. You should see a command line prompt
+      through which you can issue commands like <emphasis>ls</emphasis> to query
+      the ZooKeeper service.
+    </para>
+  </section>
+  
+  <section id="ch_UseCases">
+    <title>Example use cases</title>
+    <para>
+      Two example use cases for Observers are listed below. In fact, wherever
+      you wish to scale the number of clients of your ZooKeeper ensemble, or
+      where you wish to insulate the critical part of an ensemble from the load
+      of dealing with client requests, Observers are a good architectural
+      choice.
+    </para>
+    <itemizedlist>
+      <listitem>
+	<para> As a datacenter bridge: Forming a ZK ensemble between two
+	datacenters is a problematic endeavour as the high variance in latency
+	between the datacenters could lead to false positive failure detection
+	and partitioning. However if the ensemble runs entirely in one
+	datacenter, and the second datacenter runs only Observers, partitions
+	aren't problematic as the ensemble remains connected. Clients of the
+	Observers may still see and issue proposals.</para>
+      </listitem>
+      <listitem>
+	<para>As a link to a message bus: Some companies have expressed an
+	interest in using ZK as a component of a persistent reliable message
+	bus. Observers would give a natural integration point for this work: a
+	plug-in mechanism could be used to attach the stream of proposals an
+	Observer sees to a publish-subscribe system, again without loading the
+	core ensemble.
+	</para>
+      </listitem>
+    </itemizedlist>
+  </section>
+</article>

http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml
new file mode 100644
index 0000000..a2445b1
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml
@@ -0,0 +1,46 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="bk_OtherInfo">
+  <title>ZooKeeper</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License. You may
+      obtain a copy of the License at <ulink
+      url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+
+      <para>Unless required by applicable law or agreed to in writing,
+      software distributed under the License is distributed on an "AS IS"
+      BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied. See the License for the specific language governing permissions
+      and limitations under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para> currently empty </para>
+    </abstract>
+  </articleinfo>
+
+  <section id="ch_placeholder">
+    <title>Other Info</title>
+    <para> currently empty </para>
+  </section>
+</article>

http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml
----------------------------------------------------------------------
diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml
new file mode 100644
index 0000000..f972657
--- /dev/null
+++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml
@@ -0,0 +1,464 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright 2002-2004 The Apache Software Foundation
+
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
+"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
+<article id="bk_Overview">
+  <title>ZooKeeper</title>
+
+  <articleinfo>
+    <legalnotice>
+      <para>Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License. You may
+      obtain a copy of the License at <ulink
+      url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>
+
+      <para>Unless required by applicable law or agreed to in writing,
+      software distributed under the License is distributed on an "AS IS"
+      BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied. See the License for the specific language governing permissions
+      and limitations under the License.</para>
+    </legalnotice>
+
+    <abstract>
+      <para>This document contains overview information about ZooKeeper. It
+      discusses design goals, key concepts, implementation, and
+      performance.</para>
+    </abstract>
+  </articleinfo>
+
+  <section id="ch_DesignOverview">
+    <title>ZooKeeper: A Distributed Coordination Service for Distributed
+    Applications</title>
+
+    <para>ZooKeeper is a distributed, open-source coordination service for
+    distributed applications. It exposes a simple set of primitives that
+    distributed applications can build upon to implement higher level services
+    for synchronization, configuration maintenance, and groups and naming. It
+    is designed to be easy to program to, and uses a data model styled after
+    the familiar directory tree structure of file systems. It runs in Java and
+    has bindings for both Java and C.</para>
+
+    <para>Coordination services are notoriously hard to get right. They are
+    especially prone to errors such as race conditions and deadlock. The
+    motivation behind ZooKeeper is to relieve distributed applications the
+    responsibility of implementing coordination services from scratch.</para>
+
+    <section id="sc_designGoals">
+      <title>Design Goals</title>
+
+      <para><emphasis role="bold">ZooKeeper is simple.</emphasis> ZooKeeper
+      allows distributed processes to coordinate with each other through a
+      shared hierarchal namespace which is organized similarly to a standard
+      file system. The name space consists of data registers - called znodes,
+      in ZooKeeper parlance - and these are similar to files and directories.
+      Unlike a typical file system, which is designed for storage, ZooKeeper
+      data is kept in-memory, which means ZooKeeper can achieve high
+      throughput and low latency numbers.</para>
+
+      <para>The ZooKeeper implementation puts a premium on high performance,
+      highly available, strictly ordered access. The performance aspects of
+      ZooKeeper means it can be used in large, distributed systems. The
+      reliability aspects keep it from being a single point of failure. The
+      strict ordering means that sophisticated synchronization primitives can
+      be implemented at the client.</para>
+
+      <para><emphasis role="bold">ZooKeeper is replicated.</emphasis> Like the
+      distributed processes it coordinates, ZooKeeper itself is intended to be
+      replicated over a sets of hosts called an ensemble.</para>
+
+      <figure>
+        <title>ZooKeeper Service</title>
+
+        <mediaobject>
+          <imageobject>
+            <imagedata fileref="images/zkservice.jpg" />
+          </imageobject>
+        </mediaobject>
+      </figure>
+
+      <para>The servers that make up the ZooKeeper service must all know about
+      each other. They maintain an in-memory image of state, along with a
+      transaction logs and snapshots in a persistent store. As long as a
+      majority of the servers are available, the ZooKeeper service will be
+      available.</para>
+
+      <para>Clients connect to a single ZooKeeper server. The client maintains
+      a TCP connection through which it sends requests, gets responses, gets
+      watch events, and sends heart beats. If the TCP connection to the server
+      breaks, the client will connect to a different server.</para>
+
+      <para><emphasis role="bold">ZooKeeper is ordered.</emphasis> ZooKeeper
+      stamps each update with a number that reflects the order of all
+      ZooKeeper transactions. Subsequent operations can use the order to
+      implement higher-level abstractions, such as synchronization
+      primitives.</para>
+
+      <para><emphasis role="bold">ZooKeeper is fast.</emphasis> It is
+      especially fast in "read-dominant" workloads. ZooKeeper applications run
+      on thousands of machines, and it performs best where reads are more
+      common than writes, at ratios of around 10:1.</para>
+    </section>
+
+    <section id="sc_dataModelNameSpace">
+      <title>Data model and the hierarchical namespace</title>
+
+      <para>The name space provided by ZooKeeper is much like that of a
+      standard file system. A name is a sequence of path elements separated by
+      a slash (/). Every node in ZooKeeper's name space is identified by a
+      path.</para>
+
+      <figure>
+        <title>ZooKeeper's Hierarchical Namespace</title>
+
+        <mediaobject>
+          <imageobject>
+            <imagedata fileref="images/zknamespace.jpg" />
+          </imageobject>
+        </mediaobject>
+      </figure>
+    </section>
+
+    <section>
+      <title>Nodes and ephemeral nodes</title>
+
+      <para>Unlike standard file systems, each node in a ZooKeeper
+      namespace can have data associated with it as well as children. It is
+      like having a file-system that allows a file to also be a directory.
+      (ZooKeeper was designed to store coordination data: status information,
+      configuration, location information, etc., so the data stored at each
+      node is usually small, in the byte to kilobyte range.) We use the term
+      <emphasis>znode</emphasis> to make it clear that we are talking about
+      ZooKeeper data nodes.</para>
+
+      <para>Znodes maintain a stat structure that includes version numbers for
+      data changes, ACL changes, and timestamps, to allow cache validations
+      and coordinated updates. Each time a znode's data changes, the version
+      number increases. For instance, whenever a client retrieves data it also
+      receives the version of the data.</para>
+
+      <para>The data stored at each znode in a namespace is read and written
+      atomically. Reads get all the data bytes associated with a znode and a
+      write replaces all the data. Each node has an Access Control List (ACL)
+      that restricts who can do what.</para>
+
+      <para>ZooKeeper also has the notion of ephemeral nodes. These znodes
+      exists as long as the session that created the znode is active. When the
+      session ends the znode is deleted. Ephemeral nodes are useful when you
+      want to implement <emphasis>[tbd]</emphasis>.</para>
+    </section>
+
+    <section>
+      <title>Conditional updates and watches</title>
+
+      <para>ZooKeeper supports the concept of <emphasis>watches</emphasis>.
+      Clients can set a watch on a znode. A watch will be triggered and
+      removed when the znode changes. When a watch is triggered, the client
+      receives a packet saying that the znode has changed. If the
+      connection between the client and one of the Zoo Keeper servers is
+      broken, the client will receive a local notification. These can be used
+      to <emphasis>[tbd]</emphasis>.</para>
+    </section>
+
+    <section>
+      <title>Guarantees</title>
+
+      <para>ZooKeeper is very fast and very simple. Since its goal, though, is
+      to be a basis for the construction of more complicated services, such as
+      synchronization, it provides a set of guarantees. These are:</para>
+
+      <itemizedlist>
+        <listitem>
+          <para>Sequential Consistency - Updates from a client will be applied
+          in the order that they were sent.</para>
+        </listitem>
+
+        <listitem>
+          <para>Atomicity - Updates either succeed or fail. No partial
+          results.</para>
+        </listitem>
+
+        <listitem>
+          <para>Single System Image - A client will see the same view of the
+          service regardless of the server that it connects to.</para>
+        </listitem>
+      </itemizedlist>
+
+      <itemizedlist>
+        <listitem>
+          <para>Reliability - Once an update has been applied, it will persist
+          from that time forward until a client overwrites the update.</para>
+        </listitem>
+      </itemizedlist>
+
+      <itemizedlist>
+        <listitem>
+          <para>Timeliness - The clients view of the system is guaranteed to
+          be up-to-date within a certain time bound.</para>
+        </listitem>
+      </itemizedlist>
+
+      <para>For more information on these, and how they can be used, see
+      <emphasis>[tbd]</emphasis></para>
+    </section>
+
+    <section>
+      <title>Simple API</title>
+
+      <para>One of the design goals of ZooKeeper is provide a very simple
+      programming interface. As a result, it supports only these
+      operations:</para>
+
+      <variablelist>
+        <varlistentry>
+          <term>create</term>
+
+          <listitem>
+            <para>creates a node at a location in the tree</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>delete</term>
+
+          <listitem>
+            <para>deletes a node</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>exists</term>
+
+          <listitem>
+            <para>tests if a node exists at a location</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>get data</term>
+
+          <listitem>
+            <para>reads the data from a node</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>set data</term>
+
+          <listitem>
+            <para>writes data to a node</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>get children</term>
+
+          <listitem>
+            <para>retrieves a list of children of a node</para>
+          </listitem>
+        </varlistentry>
+
+        <varlistentry>
+          <term>sync</term>
+
+          <listitem>
+            <para>waits for data to be propagated</para>
+          </listitem>
+        </varlistentry>
+      </variablelist>
+
+      <para>For a more in-depth discussion on these, and how they can be used
+      to implement higher level operations, please refer to
+      <emphasis>[tbd]</emphasis></para>
+    </section>
+
+    <section>
+      <title>Implementation</title>
+
+      <para><xref linkend="fg_zkComponents" /> shows the high-level components
+      of the ZooKeeper service. With the exception of the request processor,
+     each of
+      the servers that make up the ZooKeeper service replicates its own copy
+      of each of the components.</para>
+
+      <figure id="fg_zkComponents">
+        <title>ZooKeeper Components</title>
+
+        <mediaobject>
+          <imageobject>
+            <imagedata fileref="images/zkcomponents.jpg" />
+          </imageobject>
+        </mediaobject>
+      </figure>
+
+      <para>The replicated database is an in-memory database containing the
+      entire data tree. Updates are logged to disk for recoverability, and
+      writes are serialized to disk before they are applied to the in-memory
+      database.</para>
+
+      <para>Every ZooKeeper server services clients. Clients connect to
+      exactly one server to submit irequests. Read requests are serviced from
+      the local replica of each server database. Requests that change the
+      state of the service, write requests, are processed by an agreement
+      protocol.</para>
+
+      <para>As part of the agreement protocol all write requests from clients
+      are forwarded to a single server, called the
+      <emphasis>leader</emphasis>. The rest of the ZooKeeper servers, called
+      <emphasis>followers</emphasis>, receive message proposals from the
+      leader and agree upon message delivery. The messaging layer takes care
+      of replacing leaders on failures and syncing followers with
+      leaders.</para>
+
+      <para>ZooKeeper uses a custom atomic messaging protocol. Since the
+      messaging layer is atomic, ZooKeeper can guarantee that the local
+      replicas never diverge. When the leader receives a write request, it
+      calculates what the state of the system is when the write is to be
+      applied and transforms this into a transaction that captures this new
+      state.</para>
+    </section>
+
+    <section>
+      <title>Uses</title>
+
+      <para>The programming interface to ZooKeeper is deliberately simple.
+      With it, however, you can implement higher order operations, such as
+      synchronizations primitives, group membership, ownership, etc. Some
+      distributed applications have used it to: <emphasis>[tbd: add uses from
+      white paper and video presentation.]</emphasis> For more information, see
+      <emphasis>[tbd]</emphasis></para>
+    </section>
+
+    <section>
+      <title>Performance</title>
+
+      <para>ZooKeeper is designed to be highly performant. But is it? The
+      results of the ZooKeeper's development team at Yahoo! Research indicate
+      that it is. (See <xref linkend="fg_zkPerfRW" />.) It is especially high
+      performance in applications where reads outnumber writes, since writes
+      involve synchronizing the state of all servers. (Reads outnumbering
+      writes is typically the case for a coordination service.)</para>
+
+      <figure id="fg_zkPerfRW">
+        <title>ZooKeeper Throughput as the Read-Write Ratio Varies</title>
+
+        <mediaobject>
+          <imageobject>
+            <imagedata fileref="images/zkperfRW-3.2.jpg" />
+          </imageobject>
+        </mediaobject>
+      </figure>
+      <para>The figure <xref linkend="fg_zkPerfRW"/> is a throughput
+      graph of ZooKeeper release 3.2 running on servers with dual 2Ghz
+      Xeon and two SATA 15K RPM drives.  One drive was used as a
+      dedicated ZooKeeper log device. The snapshots were written to
+      the OS drive. Write requests were 1K writes and the reads were
+      1K reads.  "Servers" indicate the size of the ZooKeeper
+      ensemble, the number of servers that make up the
+      service. Approximately 30 other servers were used to simulate
+      the clients. The ZooKeeper ensemble was configured such that
+      leaders do not allow connections from clients.</para>
+
+      <note><para>In version 3.2 r/w performance improved by ~2x
+      compared to the <ulink
+      url="http://zookeeper.apache.org/docs/r3.1.1/zookeeperOver.html#Performance">previous
+      3.1 release</ulink>.</para></note>
+
+      <para>Benchmarks also indicate that it is reliable, too. <xref
+      linkend="fg_zkPerfReliability" /> shows how a deployment responds to
+      various failures. The events marked in the figure are the
+      following:</para>
+
+      <orderedlist>
+        <listitem>
+          <para>Failure and recovery of a follower</para>
+        </listitem>
+
+        <listitem>
+          <para>Failure and recovery of a different follower</para>
+        </listitem>
+
+        <listitem>
+          <para>Failure of the leader</para>
+        </listitem>
+
+        <listitem>
+          <para>Failure and recovery of two followers</para>
+        </listitem>
+
+        <listitem>
+          <para>Failure of another leader</para>
+        </listitem>
+      </orderedlist>
+    </section>
+
+    <section>
+      <title>Reliability</title>
+
+      <para>To show the behavior of the system over time as
+        failures are injected we ran a ZooKeeper service made up of
+        7 machines. We ran the same saturation benchmark as before,
+        but this time we kept the write percentage at a constant
+        30%, which is a conservative ratio of our expected
+        workloads.
+      </para>
+      <figure id="fg_zkPerfReliability">
+        <title>Reliability in the Presence of Errors</title>
+        <mediaobject>
+          <imageobject>
+            <imagedata fileref="images/zkperfreliability.jpg" />
+          </imageobject>
+        </mediaobject>
+      </figure>
+
+      <para>The are a few important observations from this graph. First, if
+      followers fail and recover quickly, then ZooKeeper is able to sustain a
+      high throughput despite the failure. But maybe more importantly, the
+      leader election algorithm allows for the system to recover fast enough
+      to prevent throughput from dropping substantially. In our observations,
+      ZooKeeper takes less than 200ms to elect a new leader. Third, as
+      followers recover, ZooKeeper is able to raise throughput again once they
+      start processing requests.</para>
+    </section>
+
+    <section>
+      <title>The ZooKeeper Project</title>
+
+      <para>ZooKeeper has been
+          <ulink url="https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy">
+          successfully used
+        </ulink>
+        in many industrial applications.  It is used at Yahoo! as the
+        coordination and failure recovery service for Yahoo! Message
+        Broker, which is a highly scalable publish-subscribe system
+        managing thousands of topics for replication and data
+        delivery.  It is used by the Fetching Service for Yahoo!
+        crawler, where it also manages failure recovery. A number of
+        Yahoo! advertising systems also use ZooKeeper to implement
+        reliable services.
+      </para>
+
+      <para>All users and developers are encouraged to join the
+        community and contribute their expertise. See the
+        <ulink url="http://zookeeper.apache.org/">
+          Zookeeper Project on Apache
+        </ulink>
+        for more information.
+      </para>
+    </section>
+  </section>
+</article>


Mime
View raw message