Getting Started: Coordinating Distributed Applications with
+ ZooKeeper
+
+
This document contains information to get you started quickly with
+ ZooKeeper. It is aimed primarily at developers hoping to try it out, and
+ contains simple installation instructions for a single ZooKeeper server, a
+ few commands to verify that it is running, and a simple programming
+ example. Finally, as a convenience, there are a few sections regarding
+ more complicated installations, for example running replicated
+ deployments, and optimizing the transaction log. However for the complete
+ instructions for commercial deployments, please refer to the ZooKeeper
+ Administrator's Guide.
To get a ZooKeeper distribution, download a recent
+
+ stable release from one of the Apache Download
+ Mirrors.
+
+
Standalone Operation
+
Setting up a ZooKeeper server in standalone mode is
+ straightforward. The server is contained in a single JAR file,
+ so installation consists of creating a configuration.
+
Once you've downloaded a stable ZooKeeper release unpack
+ it and cd to the root
+
To start ZooKeeper you need a configuration file. Here is a sample,
+ create it in conf/zoo.cfg:
This file can be called anything, but for the sake of this
+ discussion call
+ it conf/zoo.cfg. Change the
+ value of dataDir to specify an
+ existing (empty to start with) directory. Here are the meanings
+ for each of the fields:
+
+
+
+
+tickTime
+
+
+
+
the basic time unit in milliseconds used by ZooKeeper. It is
+ used to do heartbeats and the minimum session timeout will be
+ twice the tickTime.
+
+
+
+
+
+
+
+dataDir
+
+
+
+
the location to store the in-memory database snapshots and,
+ unless specified otherwise, the transaction log of updates to the
+ database.
+
+
+
+
+
+clientPort
+
+
+
+
the port to listen for client connections
+
+
+
+
Now that you created the configuration file, you can start
+ ZooKeeper:
+
bin/zkServer.sh start
+
ZooKeeper logs messages using log4j -- more detail
+ available in the
+ Logging
+ section of the Programmer's Guide. You will see log messages
+ coming to the console (default) and/or a log file depending on
+ the log4j configuration.
+
The steps outlined here run ZooKeeper in standalone mode. There is
+ no replication, so if ZooKeeper process fails, the service will go down.
+ This is fine for most development situations, but to run ZooKeeper in
+ replicated mode, please see Running Replicated
+ ZooKeeper.
+
+
Managing ZooKeeper Storage
+
For long running production systems ZooKeeper storage must
+ be managed externally (dataDir and logs). See the section on
+ maintenance for
+ more details.
+
+
Connecting to ZooKeeper
+
Once ZooKeeper is running, you have several options for connection
+ to it:
+
+
+
+
+
+Java: Use
+
+
+
bin/zkCli.sh -server 127.0.0.1:2181
+
+
+
This lets you perform simple, file-like operations.
+
+
+
+
+
+
+
+C: compile cli_mt
+ (multi-threaded) or cli_st (single-threaded) by running
+ make cli_mt or make
+ cli_st in
+ the src/c subdirectory in
+ the ZooKeeper sources. See the README contained within
+ src/c for full details.
+
+
+
You can run the program
+ from src/c using:
+
+
+
LD_LIBRARY_PATH=. cli_mt 127.0.0.1:2181
+
+
+
or
+
+
+
LD_LIBRARY_PATH=. cli_st 127.0.0.1:2181
+
+
This will give you a simple shell to execute file
+ system like operations on ZooKeeper.
+
+
+
+
+
Once you have connected, you should see something like:
+
+
+
+Connecting to localhost:2181
+log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
+log4j:WARN Please initialize the log4j system properly.
+Welcome to ZooKeeper!
+JLine support is enabled
+[zkshell: 0]
+
+
+ From the shell, type help to get a listing of commands that can be executed from the client, as in:
+
+
+
+[zkshell: 0] help
+ZooKeeper host:port cmd args
+ get path [watch]
+ ls path [watch]
+ set path data [version]
+ delquota [-n|-b] path
+ quit
+ printwatches on|off
+ createpath data acl
+ stat path [watch]
+ listquota path
+ history
+ setAcl path acl
+ getAcl path
+ sync path
+ redo cmdno
+ addauth scheme auth
+ delete path [version]
+ setquota -n|-b val path
+
+
+
From here, you can try a few simple commands to get a feel for this simple command line interface. First, start by issuing the list command, as
+ in ls, yielding:
+
+
+
+[zkshell: 8] ls /
+[zookeeper]
+
+
Next, create a new znode by running create /zk_test my_data. This creates a new znode and associates the string "my_data" with the node.
+ You should see:
That's it for now. To explore more, continue with the rest of this document and see the Programmer's Guide.
+
+
Programming to ZooKeeper
+
ZooKeeper has a Java bindings and C bindings. They are
+ functionally equivalent. The C bindings exist in two variants: single
+ threaded and multi-threaded. These differ only in how the messaging loop
+ is done. For more information, see the Programming
+ Examples in the ZooKeeper Programmer's Guide for
+ sample code using of the different APIs.
+
+
Running Replicated ZooKeeper
+
Running ZooKeeper in standalone mode is convenient for evaluation,
+ some development, and testing. But in production, you should run
+ ZooKeeper in replicated mode. A replicated group of servers in the same
+ application is called a quorum, and in replicated
+ mode, all servers in the quorum have copies of the same configuration
+ file. The file is similar to the one used in standalone mode, but with a
+ few differences. Here is an example:
The new entry, initLimit is
+ timeouts ZooKeeper uses to limit the length of time the ZooKeeper
+ servers in quorum have to connect to a leader. The entry syncLimit limits how far out of date a server can
+ be from a leader.
+
With both of these timeouts, you specify the unit of time using
+ tickTime. In this example, the timeout
+ for initLimit is 5 ticks at 2000 milleseconds a tick, or 10
+ seconds.
+
The entries of the form server.X list the
+ servers that make up the ZooKeeper service. When the server starts up,
+ it knows which server it is by looking for the file
+ myid in the data directory. That file has the
+ contains the server number, in ASCII.
+
Finally, note the two port numbers after each server
+ name: " 2888" and "3888". Peers use the former port to connect
+ to other peers. Such a connection is necessary so that peers
+ can communicate, for example, to agree upon the order of
+ updates. More specifically, a ZooKeeper server uses this port
+ to connect followers to the leader. When a new leader arises, a
+ follower opens a TCP connection to the leader using this
+ port. Because the default leader election also uses TCP, we
+ currently require another port for leader election. This is the
+ second port in the server entry.
+
+
+
Note
+
+
+
If you want to test multiple servers on a single
+ machine, specify the servername
+ as localhost with unique quorum &
+ leader election ports (i.e. 2888:3888, 2889:3889, 2890:3890 in
+ the example above) for each server.X in that server's config
+ file. Of course separate dataDirs and
+ distinct clientPorts are also necessary
+ (in the above replicated example, running on a
+ single localhost, you would still have
+ three config files).
+
+
+
+
+
Other Optimizations
+
There are a couple of other configuration parameters that can
+ greatly increase performance:
+
+
+
+
+
To get low latencies on updates it is important to
+ have a dedicated transaction log directory. By default
+ transaction logs are put in the same directory as the data
+ snapshots and myid file. The dataLogDir
+ parameters indicates a different directory to use for the
+ transaction logs.
In this tutorial, we show simple implementations of barriers and
+ producer-consumer queues using ZooKeeper. We call the respective classes Barrier and Queue.
+ These examples assume that you have at least one ZooKeeper server running.
+
Both primitives use the following common excerpt of code:
Both classes extend SyncPrimitive. In this way, we execute steps that are
+common to all primitives in the constructor of SyncPrimitive. To keep the examples
+simple, we create a ZooKeeper object the first time we instantiate either a barrier
+object or a queue object, and we declare a static variable that is a reference
+to this object. The subsequent instances of Barrier and Queue check whether a
+ZooKeeper object exists. Alternatively, we could have the application creating a
+ZooKeeper object and passing it to the constructor of Barrier and Queue.
+
+We use the process() method to process notifications triggered due to watches.
+In the following discussion, we present code that sets watches. A watch is internal
+structure that enables ZooKeeper to notify a client of a change to a node. For example,
+if a client is waiting for other clients to leave a barrier, then it can set a watch and
+wait for modifications to a particular node, which can indicate that it is the end of the wait.
+This point becomes clear once we go over the examples.
+
+
+
+
+
+
Barriers
+
+
+ A barrier is a primitive that enables a group of processes to synchronize the
+ beginning and the end of a computation. The general idea of this implementation
+ is to have a barrier node that serves the purpose of being a parent for individual
+ process nodes. Suppose that we call the barrier node "/b1". Each process "p" then
+ creates a node "/b1/p". Once enough processes have created their corresponding
+ nodes, joined processes can start the computation.
+
+
In this example, each process instantiates a Barrier object, and its constructor takes as parameters:
+
+
+
the address of a ZooKeeper server (e.g., "zoo1.foo.com:2181")
+
+
+
+
the path of the barrier node on ZooKeeper (e.g., "/b1")
+
+
+
+
the size of the group of processes
+
+
+
+
The constructor of Barrier passes the address of the Zookeeper server to the
+constructor of the parent class. The parent class creates a ZooKeeper instance if
+one does not exist. The constructor of Barrier then creates a
+barrier node on ZooKeeper, which is the parent node of all process nodes, and
+we call root (Note: This is not the ZooKeeper root "/").
+To enter the barrier, a process calls enter(). The process creates a node under
+the root to represent it, using its host name to form the node name. It then wait
+until enough processes have entered the barrier. A process does it by checking
+the number of children the root node has with "getChildren()", and waiting for
+notifications in the case it does not have enough. To receive a notification when
+there is a change to the root node, a process has to set a watch, and does it
+through the call to "getChildren()". In the code, we have that "getChildren()"
+has two parameters. The first one states the node to read from, and the second is
+a boolean flag that enables the process to set a watch. In the code the flag is true.
+
+Note that enter() throws both KeeperException and InterruptedException, so it is
+the reponsability of the application to catch and handle such exceptions.
+
+Once the computation is finished, a process calls leave() to leave the barrier.
+First it deletes its corresponding node, and then it gets the children of the root
+node. If there is at least one child, then it waits for a notification (obs: note
+that the second parameter of the call to getChildren() is true, meaning that
+ZooKeeper has to set a watch on the the root node). Upon reception of a notification,
+it checks once more whether the root node has any child.
+A producer-consumer queue is a distributed data estructure thata group of processes
+use to generate and consume items. Producer processes create new elements and add
+them to the queue. Consumer processes remove elements from the list, and process them.
+In this implementation, the elements are simple integers. The queue is represented
+by a root node, and to add an element to the queue, a producer process creates a new node,
+a child of the root node.
+
+
+The following excerpt of code corresponds to the constructor of the object. As
+with Barrier objects, it first calls the constructor of the parent class, SyncPrimitive,
+that creates a ZooKeeper object if one doesn't exist. It then verifies if the root
+node of the queue exists, and creates if it doesn't.
+
+A producer process calls "produce()" to add an element to the queue, and passes
+an integer as an argument. To add an element to the queue, the method creates a
+new node using "create()", and uses the SEQUENCE flag to instruct ZooKeeper to
+append the value of the sequencer counter associated to the root node. In this way,
+we impose a total order on the elements of the queue, thus guaranteeing that the
+oldest element of the queue is the next one consumed.
+
+
+ /**
+ * Add element to the queue.
+ *
+ * @param i
+ * @return
+ */
+
+ boolean produce(int i) throws KeeperException, InterruptedException{
+ ByteBuffer b = ByteBuffer.allocate(4);
+ byte[] value;
+
+ // Add child with value i
+ b.putInt(i);
+ value = b.array();
+ zk.create(root + "/element", value, Ids.OPEN_ACL_UNSAFE,
+ CreateMode.PERSISTENT_SEQUENTIAL);
+
+ return true;
+ }
+
+
+To consume an element, a consumer process obtains the children of the root node,
+reads the node with smallest counter value, and returns the element. Note that
+if there is a conflict, then one of the two contending processes won't be able to
+delete the node and the delete operation will throw an exception.
+
+A call to getChildren() returns the list of children in lexicographic order.
+As lexicographic order does not necessary follow the numerical order of the counter
+values, we need to decide which element is the smallest. To decide which one has
+the smallest counter value, we traverse the list, and remove the prefix "element"
+from each one.
+
+ /**
+ * Remove first element from the queue.
+ *
+ * @return
+ * @throws KeeperException
+ * @throws InterruptedException
+ */
+ int consume() throws KeeperException, InterruptedException{
+ int retvalue = -1;
+ Stat stat = null;
+
+ // Get the first element available
+ while (true) {
+ synchronized (mutex) {
+ List<String> list = zk.getChildren(root, true);
+ if (list.size() == 0) {
+ System.out.println("Going to wait");
+ mutex.wait();
+ } else {
+ Integer min = new Integer(list.get(0).substring(7));
+ for(String s : list){
+ Integer tempValue = new Integer(s.substring(7));
+ //System.out.println("Temporary value: " + tempValue);
+ if(tempValue < min) min = tempValue;
+ }
+ System.out.println("Temporary value: " + root + "/element" + min);
+ byte[] b = zk.getData(root + "/element" + min,
+ false, stat);
+ zk.delete(root + "/element" + min, 0);
+ ByteBuffer buffer = ByteBuffer.wrap(b);
+ retvalue = buffer.getInt();
+
+ return retvalue;
+ }
+ }
+ }
+ }
+ }
+
@@ -281,6 +287,17 @@ document.write("Last Published: " + docu
as event handles or queues, a more practical means of performing the same
function. In general, the examples in this section are designed to
stimulate thought.
+
+
Important Note About Error Handling
+
When implementing the recipes you must handle recoverable exceptions
+ (see the FAQ). In
+ particular, several of the recipes employ sequential ephemeral
+ nodes. When creating a sequential ephemeral node there is an error case in
+ which the create() succeeds on the server but the server crashes before
+ returning the name of the node to the client. When the client reconnects its
+ session is still valid and, thus, the node is not removed. The implication is
+ that it is difficult for the client to know if its node was created or not. The
+ recipes below include measures to handle this.
Out of the Box Applications: Name Service, Configuration, Group
Membership
Call create( ) with a pathname
- of "_locknode_/lock-" and the sequence and
- ephemeral flags set.
+ of "_locknode_/guid-lock-" and the sequence and
+ ephemeral flags set. The guid
+ is needed in case the create() result is missed. See the note below.
If a recoverable error occurs calling create() the
+ client should call getChildren() and check for a node
+ containing the guid used in the path name.
+ This handles the case (noted above) of
+ the create() succeeding on the server but the server crashing before returning the name
+ of the new node.
+
+
+
+
Shared Locks
You can implement shared locks by with a few changes to the lock
@@ -666,7 +700,7 @@ document.write("Last Published: " + docu
Call create( ) to
create a node with pathname
- "_locknode_/read-". This is the
+ "guid-/read-". This is the
lock node use later in the protocol. Make sure to set both
the sequence and
ephemeral flags.
Call create( ) to
create a node with pathname
- "_locknode_/write-". This is the
+ "guid-/write-". This is the
lock node spoken of later in the protocol. Make sure to
set both sequence and
ephemeral flags.
It might appear that this recipe creates a herd effect:
when there is a large group of clients waiting for a read
@@ -796,11 +831,20 @@ document.write("Last Published: " + docu
as all those waiting reader clients should be released since
they have the lock. The herd effect refers to releasing a
"herd" when in fact only a single or a small number of
- machines can proceed.
-
+ machines can proceed.
-
-
+
+
+
+
+
+
+
+
See the note for Locks on how to use the guid in the node.
+
+
+
+
Recoverable Shared Locks
With minor modifications to the Shared Lock protocol, you make
@@ -860,7 +904,7 @@ document.write("Last Published: " + docu
A simple way of doing leader election with ZooKeeper is to use the
SEQUENCE|EPHEMERAL flags when creating
znodes that represent "proposals" of clients. The idea is to have a znode,
- say "/election", such that each znode creates a child znode "/election/n_"
+ say "/election", such that each znode creates a child znode "/election/guid-n_"
with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper
automatically appends a sequence number that is greater that any one
previously appended to a child of "/election". The process that created
@@ -889,7 +933,7 @@ document.write("Last Published: " + docu
-
Create znode z with path "ELECTION/n_" with both SEQUENCE and
+
Create znode z with path "ELECTION/guid-n_" with both SEQUENCE and
EPHEMERAL flags;
Otherwise, watch for changes on "ELECTION/n_j", where j is the
+
Otherwise, watch for changes on "ELECTION/guid-n_j", where j is the
smallest sequence number such that j < i and n_j is a znode in C;
+
Notes:
+
+
+
+
Note that the znode having no preceding znode on the list of
- children does not imply that the creator of this znode is aware that it is
- the current leader. Applications may consider creating a separate to znode
- to acknowledge that the leader has executed the leader procedure.
+ children does not imply that the creator of this znode is aware that it is
+ the current leader. Applications may consider creating a separate to znode
+ to acknowledge that the leader has executed the leader procedure.
+
+
+
+
+
+
+
+
+
See the note for Locks on how to use the guid in the node.
@@ -1155,6 +1158,21 @@ server.3=zoo3:2888:3888
the tickTime.
+
+
+fsync.warningthresholdms
+
+
+
(Java system property: fsync.warningthresholdms)
+
+New in 3.3.4: A
+ warning message will be output to the log whenever an
+ fsync in the Transactional Log (WAL) takes longer than
+ this value. The values is specified in milliseconds and
+ defaults to 1000. This value can only be set as a
+ system property.
New features that are currently considered experimental.
+
+
+
+Read Only Mode Server
+
+
+
(Java system property: readonlymode.enabled)
+
+New in 3.4.0:
+ Setting this value to true enables Read Only Mode server
+ support (disabled by default). ROM allows clients
+ sessions which requested ROM support to connect to the
+ server even when the server might be partitioned from
+ the quorum. In this mode ROM clients can still read
+ values from the ZK service, but will be unable to write
+ values and see changes from other clients. See
+ ZOOKEEPER-784 for more details.
+
+
+
+
+
Unsafe Options
The following options can be useful, but be careful when you use
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperAdmin.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperAdmin.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperAdmin.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperHierarchicalQuorums.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperHierarchicalQuorums.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperHierarchicalQuorums.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperInternals.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperInternals.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperInternals.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperJMX.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperJMX.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperJMX.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperObservers.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperObservers.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperObservers.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperOver.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperOver.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperOver.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperProgrammers.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperProgrammers.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperProgrammers.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperQuotas.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperQuotas.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperQuotas.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperStarted.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperStarted.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperStarted.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperTutorial.pdf
==============================================================================
Files websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperTutorial.pdf (original) and websites/staging/zookeeper/trunk/content/doc/trunk/zookeeperTutorial.pdf Fri Dec 16 23:33:33 2011 differ
Modified: websites/staging/zookeeper/trunk/content/index.html
==============================================================================
--- websites/staging/zookeeper/trunk/content/index.html (original)
+++ websites/staging/zookeeper/trunk/content/index.html Fri Dec 16 23:33:33 2011
@@ -120,8 +120,8 @@
Modified: websites/staging/zookeeper/trunk/content/lists.html
==============================================================================
--- websites/staging/zookeeper/trunk/content/lists.html (original)
+++ websites/staging/zookeeper/trunk/content/lists.html Fri Dec 16 23:33:33 2011
@@ -136,8 +136,8 @@ In order to post to the list, it is nece
This release fixes a critical bug with data loss in 3.4.0. See
+ZooKeeper 3.4.1 Release Notes for details.
+In case you are already using 3.4.0 release please upgrade ASAP.
+
+
Please note that this is an alpha release and not ready for production as of now.
+
26 Nov, 2011: release 3.3.4 available
The release fixes a number of critical bugs that could cause data corruption. See
@@ -80,8 +88,7 @@
22 Nov, 2011: release 3.4.0 available
-
This a major release coming after a long time. This has many features including security, multi update api's, rpm/deb support, C native windows support. See
-ZooKeeper 3.4.0 Release Notes for details. Please note that this is beta release. The current stable release is 3.3.3.
+
Due to data loss issues, this release has been removed from the downloads page. Release 3.4.1 is now available.