bookkeeper-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iv...@apache.org
Subject svn commit: r1645996 - in /bookkeeper/site/trunk/content/docs/master: bookkeeperLedgers2Logs.textile bookkeeperTutorial.textile
Date Tue, 16 Dec 2014 16:22:07 GMT
Author: ivank
Date: Tue Dec 16 16:22:06 2014
New Revision: 1645996

URL: http://svn.apache.org/r1645996
Log:
Syncing website with git

Added:
    bookkeeper/site/trunk/content/docs/master/bookkeeperLedgers2Logs.textile
    bookkeeper/site/trunk/content/docs/master/bookkeeperTutorial.textile

Added: bookkeeper/site/trunk/content/docs/master/bookkeeperLedgers2Logs.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/master/bookkeeperLedgers2Logs.textile?rev=1645996&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/master/bookkeeperLedgers2Logs.textile (added)
+++ bookkeeper/site/trunk/content/docs/master/bookkeeperLedgers2Logs.textile Tue Dec 16 16:22:06
2014
@@ -0,0 +1,56 @@
+Title:     From Ledgers to Logs
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+This documents describes the bookkeeper replication protocol, and the guarantees it gives.
It assumes you have a general idea about leader election and log replication and how you can
use these in your system. If not, have a look at the bookkeeper "tutorial":https://github.com/ivankelly/bookkeeper-tutorial
first.
+
+h1. Ledgers to Logs
+
+Bookkeeper provides a primitive, ledgers, which can be used to build a replicated log for
your system. All guarantees provided by bookkeeper are on ledgers. You can learn about the
guarantees of ledgers "here":./bookkeeperProtocol.html. Guarantees on the whole log can be
built using the ledger guarantees and any consistent datastore with a compare-and-swap(CAS)
primitive. In this case, we describe a log using zookeeper as the datastore, but others could
theoretically be used. 
+
+A log in bookkeeper is built from a number of ledgers, with a fixed order. A ledger represents
a single segment of the log. A ledger could be the whole period that one node was the leader,
or there could be multiple ledgers for a single period of leadership. However, there can only
ever been one leader that adds entries to a single ledger. Ledgers cannot be reopened for
writing once they have been closed/recovered.
+
+It's important to note that bookkeeper doesn't provide leader election. You must use a system
like Zookeeper for this.
+
+In many cases, leader election is really leader suggestion. Multiple nodes could think that
they are leader at any one time. It is the job of the log to guarantee that only one can write
changes to the system.
+
+h3. Opening a log
+
+Once a node thinks it is leader for a particular log, it must take the following steps.
+
+# read the list of ledgers for the log
+# fence the last 2 ledgers[1] in the list
+# create a new ledger
+# add the new ledger to the ledger list
+# write the new ledger list back to the datastore using a CAS operation.
+
+The fencing in step 2 and the compare-and-swap operation in step 5 prevents two nodes thinking
they have leadership at any one time. Ledger fencing is described in "Bookkeeper Protocol":./bookkeeperProtocol.html.
The compare-and-swap operation will fail if the list of ledgers has changed between reading
it and writing back the new list. When the CAS operation fails, the leader must start at step
1 again. Even better, they should check that they are in fact still the leader with the system
that is providing leader election. The protocol will work correctly without this step, though
it will be able to make very little progress if two nodes think they are leader and are duelling
for the log. 
+
+The node must not serve any writes until step 5 completes successfully.
+
+h3. Rolling ledgers
+
+The leader may wish to close the current ledger and open a new one every so often. Ledgers
can only be deleted as a whole. If you don't roll the log, you won't be able to clean up old
entries in the log without a leader change. By closing the current ledger and adding a new
one, the leader allows the log to be truncated whenever that data is no longer needed. The
steps for rolling the log is similar to those for creating a new ledger.  
+
+# create a new ledger
+# add the new ledger to the ledger list
+# write the new ledger list to the datastore using CAS
+# close the previous ledger
+
+By deferring the closing of the previous ledger until step 4, we can continue writing to
the log while we perform metadata update operations to add the new ledger. This is safe as
long as you fence the last _2_ ledgers when acquiring leadership.
+
+fn1. We fence 2 ledgers, as the write may be writing to the penultimate, while adding the
last ledger to the ledger list.

Added: bookkeeper/site/trunk/content/docs/master/bookkeeperTutorial.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/master/bookkeeperTutorial.textile?rev=1645996&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/master/bookkeeperTutorial.textile (added)
+++ bookkeeper/site/trunk/content/docs/master/bookkeeperTutorial.textile Tue Dec 16 16:22:06
2014
@@ -0,0 +1,552 @@
+Title:     Bookkeeper Client tutorial
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+This tutorial aims to show you how to build a replicated distributed system using Bookkeeper
as the replicated log. Before we start, you will need to have a bookkeeper cluster up and
running. You can download the bookkeeper distribution at "https://zookeeper.apache.org/bookkeeper/releases.html":https://zookeeper.apache.org/bookkeeper/releases.html.
The binary distribution, bookkeeper-server-4.x.x-bin.tar.gz, will be sufficient for the tutorial.
+This tutorial does not cover the setup of a distributed cluster, but you can run a local
cluster on your machine by running:
+
+<pre>
+$ bookkeeper-server/bin/bookkeeper localbookie 6
+</pre>
+
+This will start up a local zookeeper instance with 6 bookie servers, as bookkeeper storage
servers are known. Any data written to this cluster will be removed when you kill the process.
+
+The code for this tutorial is available at "https://github.com/ivankelly/bookkeeper-tutorial/":https://github.com/ivankelly/bookkeeper-tutorial/.
Each section has a link with points to a tag for the completed code for that section.
+
+h1. The base application
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/basic
+
+We have a dice application. It generates a new number between 1 and 6 every second. 
+
+<pre class="prettyprint">
+public class Dice {
+
+    Random r = new Random();
+
+    void playDice() throws InterruptedException {
+        while (true) {
+            Thread.sleep(1000);
+            System.out.println("Value = " + (r.nextInt(6) + 1));
+        }
+    }
+
+    public static void main(String[] args) throws InterruptedException {
+        Dice d = new Dice();
+        d.playDice();
+    }
+}
+</pre>
+
+Our goal is to have multiple instances of this application, possibly running on different
machine, which each display the exact same sequence of numbers. If one the the instances crashes
or becomes unable to communicate with the others in any way, it should still not diverge from
the sequence of numbers. This tutorial will show you how to achieve this.
+
+To start, download the base application, compile and run it.
+<pre>
+$ git clone https://github.com/ivankelly/bookkeeper-tutorial.git
+$ mvn package
+$ mvn exec:java -Dexec.mainClass=org.apache.bookkeeper.Dice
+[INFO] Scanning for projects...
+[INFO]                                                                         
+[INFO] ------------------------------------------------------------------------
+[INFO] Building tutorial 1.0-SNAPSHOT
+[INFO] ------------------------------------------------------------------------
+[INFO] 
+[INFO] --- exec-maven-plugin:1.3.2:java (default-cli) @ tutorial ---
+[WARNING] Warning: killAfter is now deprecated. Do you need it ? Please comment on MEXEC-6.
+Value = 4
+Value = 5
+Value = 3
+...
+...
+</pre>
+
+
+h1. Leaders and followers (and a little bit of background)
+
+To achieve this common view in multiple instances of the program, we need each instance to
agree on what the next number in the sequence will be. For example, the instances must agree
that 4 is the first number and 2 is the second number and 5 is the third number and so on.
This is a difficult problem, especially in the case that any instance may go away at any time,
and messages between the instances can be lost or reordered.
+
+Luckily, there are already algorithms to solve this. "Paxos":http://en.wikipedia.org/wiki/Paxos_%28computer_science%29
is an abstract algorithm to implement this kind of agreement, while "Zab":http://zookeeper.apache.org
and "Raft":http://en.wikipedia.org/wiki/Raft_%28computer_science%29 are more practical protocols.
"This video":https://www.youtube.com/watch?v=JEpsBg0AO6o gives a good overview about how these
algorithms usually look. They all have a similar core.
+
+It would be possible to run the Paxos to agree on each number in the sequence. However, running
Paxos each time can be expensive. What Zab and Raft do is that they use a Paxos-like algorithm
to elect a leader. The leader then decides what the sequence of events should be, putting
them in a log, which the other instances can then follow to maintain the same state as the
leader.
+
+Bookkeeper provides the functionality for the second part of the protocol, allowing a leader
to write events to a log and have multiple followers tailing the log. However, bookkeeper
does not do leader election. You will need a zookeeper or raft instance for that purpose.
+
+h2. Why not just use zookeeper for everything?
+
+There are a number of reasons:
+
+ 1. Zookeeper's log is only exposed through a tree like interface. It can be hard to shoehorn
your application into this. 
+ 2. A zookeeper ensemble of multiple machines is limited to one log. You may want one log
per resource, which will become expensive very quickly.
+ 3. Adding extra machines to a zookeeper ensemble does not increase capacity nor throughput.
+
+Bookkeeper can be viewed as a means of exposing zookeeper's replicated log to applications
in a scalable fashion. However, we still use zookeeper to maintain consistency guarantees.
+
+**TL;DR You need to elect a leader instance**
+
+h1. Electing a leader
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/election
+
+We'll use zookeeper to elect a leader. A zookeeper instance will have started locally when
you started the localbookie application above. To verify it's running, run the following command.
+
+<pre>
+$ echo stat | nc localhost 2181
+Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
+Clients:
+ /127.0.0.1:59343[1](queued=0,recved=40,sent=41)
+ /127.0.0.1:49354[1](queued=0,recved=11,sent=11)
+ /127.0.0.1:49361[0](queued=0,recved=1,sent=0)
+ /127.0.0.1:59344[1](queued=0,recved=38,sent=39)
+ /127.0.0.1:59345[1](queued=0,recved=38,sent=39)
+ /127.0.0.1:59346[1](queued=0,recved=38,sent=39)
+
+Latency min/avg/max: 0/0/23
+Received: 167
+Sent: 170
+Connections: 6
+Outstanding: 0
+Zxid: 0x11
+Mode: standalone
+Node count: 16
+</pre>
+
+To interact with zookeeper, we'll use the "Curator":https://curator.apache.org/ client rather
than the stock zookeeper client. Getting things right with the zookeeper client can be tricky,
and curator removes a lot of the pointy corners for you. In fact, curator even provides a
leader election recipe, so we need to do very little work to get leader election in our application.
+
+<pre class="prettyprint">
+public class Dice extends LeaderSelectorListenerAdapter implements Closeable {
+
+    final static String ZOOKEEPER_SERVER = "127.0.0.1:2181";
+    final static String ELECTION_PATH = "/dice-elect";
+
+    ...
+
+    Dice() throws InterruptedException {
+        curator = CuratorFrameworkFactory.newClient(ZOOKEEPER_SERVER,
+                2000, 10000, new ExponentialBackoffRetry(1000, 3));
+        curator.start();
+        curator.blockUntilConnected();
+
+        leaderSelector = new LeaderSelector(curator, ELECTION_PATH, this);
+        leaderSelector.autoRequeue();
+        leaderSelector.start();
+    }
+</pre>
+
+In the constructor for Dice, we need to create the curator client. We specify four things
when creating the client, the location of the zookeeper service, the session timeout, the
connect timeout and the retry policy.
+
+The session timeout is a zookeeper concept. If the zookeeper server doesn't hear anything
from the client for this amount of time, any leases which the client holds will be timed out.
This is important in leader election. For leader election, the curator client will take a
lease on ELECTION_PATH. The first instance to take the lease will become leader and the rest
will become followers. However, their claim on the lease will remain in the cue. If the first
instance then goes away, due to a crash etc., its session will timeout. Once the session times
out, the lease will be released and the next instance in the queue will become the leader.
The call to <mark>autoRequeue()</mark> will make the client queue itself again
if it loses the lease for some other reason, such as if it was still alive, but it a garbage
collection cycle caused it to lose its session, and thereby its lease. I've set the lease
to be quite low so that when we test out leader election, transitions will be quite quic
 k. The optimum length for session timeout depends very much on the use case. The other parameters
are the connection timeout, i.e. the amount of time it will spend trying to connect to a zookeeper
server before giving up, and the retry policy. The retry policy specifies how the client should
respond to transient errors, such as connection loss. Operations that fail with transient
errors can be retried, and this argument specifies how often the retries should occur.
+
+Finally, you'll have noticed that Dice now extends <mark>LeaderSelectorListenerAdapter</mark>
and implements <mark>Closeable</mark>. <mark>Closeable</mark> is there
to close the resource we have initialized in the constructor, the curator client and the <mark>leaderSelector</mark>.
<mark>LeaderSelectorListenerAdapter</mark> is a callback that the <mark>leaderSelector</mark>
uses to notify the instance that it is now the leader. It is passed as the third argument
to the <mark>LeaderSelector</mark> constructor.
+
+<pre class="prettyprint">
+    @Override
+    public void takeLeadership(CuratorFramework client)
+            throws Exception {
+        synchronized (this) {
+            leader = true;
+            try {
+                while (true) {
+                    this.wait();
+                }
+            } catch (InterruptedException ie) {
+                Thread.currentThread().interrupt();
+                leader = false;
+            }
+        }
+    }
+</pre>
+
+<mark>takeLeadership()</mark> is the callback called by <mark>LeaderSelector</mark>
when the instance is leader. It should only return when the instance wants to give up leadership.
In our case, we never do so we wait on the current object until we're interrupted. To signal
to the rest of the program that we are leader we set a volatile boolean called leader to true.
This is unset after we are interrupted.
+
+<pre class="prettyprint">
+    void playDice() throws InterruptedException {
+        while (true) {
+            while (leader) {
+                Thread.sleep(1000);
+                System.out.println("Value = " + (r.nextInt(6) + 1)
+                                   + ", isLeader = " + leader);
+            }
+        }
+    }
+</pre>
+
+Finally we modify <mark>playDice()</mark> to only generate random numbers when
it is the leader.
+
+Run two instances of the program in two different terminals. You'll see that one becomes
leader and prints numbers and the other just sits there.
+
+Now stop the leader using Control-Z. This will pause the process, but it won't kill it. You
will be dropped back to the shell in that terminal. After a couple of seconds, the session
timeout, you will see that the other instance has become the leader. Zookeeper will guarantee
that only one instance is selected as leader at any time.
+
+Now go back to the shell that the original leader was on and wake up the process using fg.
You'll see something like the following:
+
+<pre>
+...
+...
+Value = 4, isLeader = true
+Value = 4, isLeader = true
+^Z
+[1]+  Stopped                 mvn exec:java -Dexec.mainClass=org.apache.bookkeeper.Dice
+$ fg
+mvn exec:java -Dexec.mainClass=org.apache.bookkeeper.Dice
+Value = 3, isLeader = true
+Value = 1, isLeader = false
+</pre>
+
+Whats this!?! The other instance is leader, but this instance first of all thinks it is leader
and generates a number, and then generates a number even though it knows it is not leader.
In fact this is perfectly natural. The leader election happens on zookeeper, but it takes
time changes in the leader to be propagated to all instances. So a race occurs where an instance
thinks it is the leader while zookeeper thinks otherwise.
+
+To solve this problem we need to some way to prevent previous leaders from continuing to
think they are the leader. Sending a message to the previous leader isn't an option. Messages
may get lost or delayed or the previous leader may be temporarily down. Another way is to
use a shared log. All updates are written to the shared log before being applied. A new leader
can tell this log to block writes from previous leaders. This is exactly what bookkeeper does!
+
+h1. Writing to the log
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/storing
+
+Before we get into the business of blocking previous leaders from writing we need to first
implement the logic for writing to the log.
+
+<pre class="prettyprint">
+    Dice() throws Exception {
+    	...
+
+        ClientConfiguration conf = new ClientConfiguration()
+            .setZkServers(ZOOKEEPER_SERVER).setZkTimeout(30000);
+        bookkeeper = new BookKeeper(conf);
+    }
+
+</pre>
+
+We construct the bookkeeper client in the <mark>Dice</mark> constructor and configure
the zookeeper server and zookeeper session timeout that it should use. The zookeeper session
timeout can be quite large for bookkeeper, as it doesn't use anything that depends on the
session timeout logic. The bookkeeper client should also be closed in <mark>Dice#close()</mark>.
+
+<pre class="prettyprint">
+    void lead() throws Exception {
+        LedgerHandle lh = bookkeeper.createLedger(3, 3, 2,
+                BookKeeper.DigestType.MAC, DICE_PASSWD);
+        try {
+            while (leader) {
+                Thread.sleep(1000);
+                int nextInt = r.nextInt(6) + 1;
+                lh.addEntry(Ints.toByteArray(nextInt));
+                System.out.println("Value = " + nextInt
+                                   + ", isLeader = " + leader);
+            }
+        } finally {
+            lh.close();
+        }
+    }
+
+    void playDice() throws Exception {
+        while (true) {
+            if (leader) {
+                lead();
+            }
+        }
+    }
+</pre>
+
+When we become the leader, we create a new ledger. A ledger is the basic unit of bookkeeper.
It can be thought of as a segment of a larger log. At this moment we are only creating a single
ledger, but later we will be creating multiple ledgers and connecting them together to create
a shared log. For now, we just want to get data into a ledger.
+
+The ledger is created with a 3-3-2 configuration. These are the ensemble, the write quorum
and the ack quorum. The ensemble is the number of bookies the data in the ledger will be stored
on. All entries may not be stored on all bookies if the ensemble is larger than the write
quorum. The write quorum is the number of bookies each entry is written to. The ack quorum
is the number of bookies we must get a response from before we acknowledge the write to the
client. In this case, there are 3 bookies, we write to all 3 every time, but we acknowledge
to the client when we've received a response from 2. If the ensemble is larger than the write
quorum, then entries will be striped across the bookies.
+
+The digest type and password are used for checksumming. They prevent clients from overwriting
each others data in a misconfigured system. They're actually unnecessary in this example,
but the client api requires them.
+
+Once the ledger is created we can write to it. <mark>addEntry()</mark> will append
an entry onto the end of the ledger. Entries are byte arrays, so we convert the randomly generated
integer into a byte array, using "Guava":https://code.google.com/p/guava-libraries/'s Ints
utility, before adding it to the ledger.
+
+Once we are finished with a ledger we must close it. This is actually an important step and
it fixes the content of the ledger. From this point on the ledger is immutable. It cannot
be reopened for writing and its contents cannot be modified.
+
+Of course, we don't save a reference to the ledger anywhere, so once we have written it,
no one else can ever access it, even to read it. This is what we will deal with in the next
section.
+
+h1. Making the log available to others
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/sharing
+
+Previously we have written to a single ledger. However, we have not provided a way to share
this between instances. What's more, as a ledger is immutable, each leader will have to create
its own ledger. So ultimately, when the application has run for a while, having changed leaders
multiple times, we will end up with a list of ledgers. This list of ledgers represents the
log of the application. Any new instance can print the same output as any preexisting instance
by simply reading this log.
+
+This list of logs needs to be shared among all instances of the application. For this we
will use zookeeper. 
+
+<pre class="prettyprint">
+public class Dice extends LeaderSelectorListenerAdapter implements Closeable {
+    ...
+
+    final static String DICE_LOG = "/dice-log";
+</pre>
+
+We define the path of the zookeeper znode in which we want to store the log. A znode in zookeeper
is like a file. You can write and read byte arrays from a znode. However, the contents of
a znode must be written and read as a whole, so it's best to only store small pieces of data
there. Each time a znode is updated, a new version is assigned. This can be used for check-and-set
operations, which is important to avoid race conditions in distributed systems.
+
+<pre class="prettyprint">
+    void lead() throws Exception {
+        Stat stat = new Stat();
+        List<Long> ledgers;
+        boolean mustCreate = false;
+        try {
+            byte[] ledgerListBytes = curator.getData()
+                .storingStatIn(stat).forPath(DICE_LOG);
+            ledgers = listFromBytes(ledgerListBytes);
+        } catch (KeeperException.NoNodeException nne) {
+            ledgers = new ArrayList<Long>();
+            mustCreate = true;
+        }
+        for (Long previous : ledgers) {
+            LedgerHandle lh;
+            try {
+                lh = bookkeeper.openLedger(previous,
+                        BookKeeper.DigestType.MAC, DICE_PASSWD);
+            } catch (BKException.BKLedgerRecoveryException e) {
+                return;
+            }
+            Enumeration<LedgerEntry> entries
+                = lh.readEntries(0, lh.getLastAddConfirmed());
+
+            while (entries.hasMoreElements()) {
+                byte[] entryData = entries.nextElement().getEntry();
+                System.out.println("Value = " + Ints.fromByteArray(entryData)
+                                   + ", epoch = " + lh.getId()
+                                   + ", catchup");
+            }
+        }
+</pre>
+
+We read the list of ledgers from DICE_LOG and store the version in stat. As the list of ledgers
is in byte form, we need to convert into a java list. If this is the first time running, there
will be no list of ledgers, and therefore no znode containing them. In this case a <mark>NoNodeException</mark>
will occur. We take note of this using <mark>mustCreate</mark>, as it affects
how will will update the list later.
+
+Once we have the list, we loop through them, opening the ledgers and printing their contents.
It's important to note that the default open operation in bookkeeper is a fencing open. In
a fencing open, anyone who is writing to the ledger will receive an exception when they try
to write again. This is how we exclude other leaders.
+
+<pre class="prettyprint">
+    void lead() throws Exception {
+        ...
+
+        LedgerHandle lh = bookkeeper.createLedger(3, 3, 2,
+                BookKeeper.DigestType.MAC, DICE_PASSWD);
+        ledgers.add(lh.getId());
+        byte[] ledgerListBytes = listToBytes(ledgers);
+        if (mustCreate) {
+            try {
+                curator.create().forPath(DICE_LOG, ledgerListBytes);
+            } catch (KeeperException.NodeExistsException nne) {
+                return;
+            }
+        } else {
+            try {
+                curator.setData()
+                    .withVersion(stat.getVersion())
+                    .forPath(DICE_LOG, ledgerListBytes);
+            } catch (KeeperException.BadVersionException bve) {
+                return;
+            }
+        }
+
+        try {
+            while (leader) {
+                Thread.sleep(1000);
+                int nextInt = r.nextInt(6) + 1;
+                lh.addEntry(Ints.toByteArray(nextInt));
+                System.out.println("Value = " + nextInt
+                                   + ", epoch = " + lh.getId()
+                                   + ", leading");
+            }
+            lh.close();
+        } catch (BKException e) {
+            return;
+        }
+    }
+</pre>
+
+Once we have read all the previous ledgers, we create a new one and add it to the list. We
must make sure this list is updated before writing to the ledger to avoid losing data. If
<mark>create()</mark> or <mark>setData()</mark> throw an exception,
it means that someone is trying to update the list concurrently. We must examine if we are
still leader, and try again if we are. The retry is handled by the loop in <mark>playDice()</mark>.
+
+We can then write to the ledger as before. However, now we have to take care to handle the
<mark>BKException</mark>. If we receive an exception, it may mean that someone
has fenced the ledger we are writing to. This means that someone else has opened it using
<mark>openLedger()</mark>, so they must think that they are the leader. Like in
the case of concurrent modifications to the ledger list, we must examine if we are still leader
and then try again if so.
+
+Run a couple of instances of this on your machine. You'll see that when the leader changes,
it will print out the history of what was written by previous leaders.
+
+However, we have a bug! When an instance becomes leader, it will print out the whole history,
even if it has been leader before. So it is necessary to keep track of which updates we have
seen been changes of leadership.
+
+h1. Tracking the updates
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/tracking
+
+Tracking the updates is fairly simple. We just need to keep a record of the last thing we
printed, and skip past it any time we become leader. 
+
+<pre class="prettyprint">
+    EntryId lead(EntryId skipPast) throws Exception {
+        EntryId lastDisplayedEntry = skipPast;
+</pre>
+
+The signature for <mark>lead()</mark> needs to change so that the last displayed
update is passed between different invocations. <mark>EntryId</mark> is a simple
data structure, inside which we can store the ledger id and the entry id of the last update
we have displayed.
+
+<pre class="prettyprint">
+    EntryId lead(EntryId skipPast) throws Exception {
+        ...
+        List<Long> toRead = ledgers;
+        if (skipPast.getLedgerId() != -1) {
+            toRead = ledgers.subList(ledgers.indexOf(skipPast.getLedgerId()),
+                                     ledgers.size());
+        }
+
+        long nextEntry = skipPast.getEntryId() + 1;
+        for (Long previous : toRead) {
+            LedgerHandle lh;
+            try {
+                lh = bookkeeper.openLedger(previous,
+                        BookKeeper.DigestType.MAC, DICE_PASSWD);
+            } catch (BKException.BKLedgerRecoveryException e) {
+                return lastDisplayedEntry;
+            }
+
+            if (nextEntry > lh.getLastAddConfirmed()) {
+                nextEntry = 0;
+                continue;
+            }
+            Enumeration<LedgerEntry> entries
+                = lh.readEntries(nextEntry, lh.getLastAddConfirmed());
+
+            while (entries.hasMoreElements()) {
+                LedgerEntry e = entries.nextElement();
+                byte[] entryData = e.getEntry();
+                System.out.println("Value = " + Ints.fromByteArray(entryData)
+                                   + ", epoch = " + lh.getId()
+                                   + ", catchup");
+                lastDisplayedEntry = new EntryId(lh.getId(), e.getEntryId());
+            }
+        }
+        ...
+</pre>
+
+The algorithm for reading also changes. Instead of iterating through all the ledgers in the
list we only iterate through any ledger which is greater to or equal to the ledger of the
last displayed entry. We also skip past the entry id of the last displayed entry when calling
<mark>readEntries()</mark>. The only special case we need to handle is if the
last displayed entry is the last entry of a ledger. In this case, we set <mark>nextEntry</mark>
to zero, and skip to the next ledger.
+
+Any time we do read an entry and display it, we update the last displayed entry to reflect
this.
+
+<pre class="prettyprint">
+    EntryId lead(EntryId skipPast) throws Exception {
+        ...
+
+        try {
+            while (leader) {
+                Thread.sleep(1000);
+                int nextInt = r.nextInt(6) + 1;
+                long entryId = lh.addEntry(Ints.toByteArray(nextInt));
+                System.out.println("Value = " + nextInt
+                                   + ", epoch = " + lh.getId()
+                                   + ", leading");
+                lastDisplayedEntry = new EntryId(lh.getId(), entryId);
+            }
+            lh.close();
+        } catch (BKException e) {
+            // let it fall through to the return
+        }
+        return lastDisplayedEntry;
+    }
+</pre>
+
+Finally, we also update the last displayed entry any time we add a new entry to the log.
With this change, new leaders will only print numbers which they haven't seen before. You
can test this for yourself. Run two instances of the application. Stop the leader with Control-Z,
and once the other instance has become leader, resume the first one (<mark>fg</mark>).
Then kill the second leader. When the first leader becomes leader again, it will only print
the number which it missed.
+
+h1. Tailing the log
+
+"(full code)":https://github.com/ivankelly/bookkeeper-tutorial/tree/tailing
+
+Of course, it would be nicer if the followers could keep up to date with the leader in the
background without having to wait to become leaders themselves. To do this we need to tail
the log. For the most part this is very similar to how we read the previous ledgers when we
become leader. However, how we open the ledgers is different. When we open the ledgers as
leader, we need to ensure that no other instance can write to the ledgers from that point
onwards. Therefore, we use a fencing open, which is the default <mark>openLedger()</mark>
call in Bookkeeper. However, for tailing the log, we don't want to stop the leader from writing
new updates, so we use a non-fenching open, which is the <mark>openLedgerNoRecovery()</mark>
call in Bookkeeper.
+
+First we must modify <mark>playDice()</mark> to go into a following state when
we're not the leader.
+<pre class="prettyprint">
+    void playDice() throws Exception {
+        EntryId lastDisplayedEntry = new EntryId(-1, -1);
+        while (true) {
+            if (leader) {
+                lastDisplayedEntry = lead(lastDisplayedEntry);
+            } else {
+                lastDisplayedEntry = follow(lastDisplayedEntry);
+            }
+        }
+    }
+
+    EntryId follow(EntryId skipPast) throws Exception {
+        List<Long> ledgers = null;
+        while (ledgers == null) {
+            try {
+                byte[] ledgerListBytes = curator.getData()
+                    .forPath(DICE_LOG);
+                ledgers = listFromBytes(ledgerListBytes);
+                if (skipPast.getLedgerId() != -1) {
+                    ledgers = ledgers.subList(ledgers.indexOf(skipPast.getLedgerId()),
+                                              ledgers.size());
+                }
+            } catch (KeeperException.NoNodeException nne) {
+                Thread.sleep(1000);
+            }
+        }
+</pre>
+
+The first part of following is almost identical to leading. We read the list of ledgers from
zookeeper and trim the list to only include ledgers which we have displayed already. A thing
to note here, is that if we go into following mode during the first run of the application,
and the leader hasn't created the list of ledgers in zookeeper yet we will get an exception.
If this occurs we try again after 1 second.
+
+Once we have the list, we go into the main tailing loop.
+
+<pre class="prettyprint">
+    EntryId follow(EntryId skipPast) throws Exception {
+        ...
+
+        EntryId lastReadEntry = skipPast;
+        while (!leader) {
+            for (long previous : ledgers) {
+                ...
+            }
+            byte[] ledgerListBytes = curator.getData()
+                .forPath(DICE_LOG);
+            ledgers = listFromBytes(ledgerListBytes);
+            ledgers = ledgers.subList(ledgers.indexOf(lastReadEntry.getLedgerId())+1,
+                                      ledgers.size());
+        }
+        return lastReadEntry;
+    }
+</pre>
+
+While we are still leader, we loop over all ledgers in the ledgers list, printing their content.
Once we have finished with the current list of ledgers, we check zookeeper to see if any new
ledgers have been added to the list. This looks like it would run in a tight loop, but that
is not the case. Ledger reading loop will wait until the last ledger in the list is closed
before exiting the loop. When the last ledger in the list is closed, it means that the leader
must have changed, so there must be a new ledger in the list to read.
+
+<pre class="prettyprint">
+            for (long previous : ledgers) {
+                boolean isClosed = false;
+                long nextEntry = 0;
+                while (!isClosed && !leader) {
+                    if (lastReadEntry.getLedgerId() == previous) {
+                        nextEntry = lastReadEntry.getEntryId() + 1;
+                    }
+                    isClosed = bookkeeper.isClosed(previous);
+                    LedgerHandle lh = bookkeeper.openLedgerNoRecovery(previous,
+                            BookKeeper.DigestType.MAC, DICE_PASSWD);
+
+                    if (nextEntry <= lh.getLastAddConfirmed()) {
+                        ... // read all entries from nextEntry to last add confirmed
+                    }
+                    if (isClosed) {
+                        break;
+                    }
+                    Thread.sleep(1000);
+                } 
+            }
+</pre>
+
+For each ledger we enter into an inner loop. First we check if the ledger has been closed.
If so, once we have read all the entries that we can, we need to reopen the ledger to check
for any new entries. We continue like this until the ledger is either closed, or we become
leader.
+
+Note that we are using <mark>openLedgerNoRecovery()</mark> here. The value returned
by last add confirmed will change after each opening if there are new entries which can be
read. The last add confirmed is a variable maintained by the leader. It is the last entry
written for which it has received an ACK quorum of acknowledgements. In our case, this means
that the entry has been acknowledged on at least 2 bookies. It also guarantees that each entry
before it in that ledger has been acknowledged on 2 bookies.
+
+Once we have read all entries, we check isClosed to see if we need to check this ledger again.
If not, we break out of the loop and move onto the next ledger. Otherwise, we wait a second
and try again.
+
+h1. Wrap up
+
+Now you have a fully distributed dice application. Not very useful, but it should give you
some idea of what is required to make an application fault tolerant without losing consistency.
Play around with the application. Run many instances. Kill a few leaders. You will always
see the same sequence of number printed to the screen. If not, then you have found a bug,
please let us know.
+
+h1. What's next?
+
+The dice application we've written is just and example and is pretty useless in the real
world. But the principles contained therein could be used to replicate pretty much any service.
Imagine a simple key value store. This could be made replicated by adding all create, put
and delete operations to a replicated log. Multiple logs could be used if you want to shard
your store across many servers. And there are many possibilities.
+
+However, this tutorial doesn't address some issues that would be important in a real implementation.
For starters, the log of the dice application will keep growing forever, eventually filling
up all your disks and grinding you to a halt. Avoiding this problem depends on your individual
usecase. For example, if you have a key value store, you can take a snapshot of the store
every so often, and then trim the start of the log to remove anything that had been applied
by the time the snapshot was taken. Trimming simply means removing ledgers from the start
of the ledger list. For a messaging application, you could keep a record of what each subscriber
has consumed and then trim the log based on that.
+
+Note that the tutorial application only uses synchronous APIs. The bookkeeper client does
also have asynchronous APIs, which allow for higher throughput when writing. However, this
means that you have to manage your state more carefully.
+
+<script src="https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js"></script>
\ No newline at end of file



Mime
View raw message