bookkeeper-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject bookkeeper git commit: BOOKKEEPER-802: Bookkeeper protocol documentation (ivank via sijie)
Date Mon, 07 Sep 2015 15:58:55 GMT
Repository: bookkeeper
Updated Branches:
  refs/heads/master c29a4f260 -> a59bd5687

BOOKKEEPER-802: Bookkeeper protocol documentation (ivank via sijie)


Branch: refs/heads/master
Commit: a59bd5687e2f5c4a4ac3f2a957543b5a1a3148d8
Parents: c29a4f2
Author: Sijie Guo <>
Authored: Mon Sep 7 08:58:26 2015 -0700
Committer: Sijie Guo <>
Committed: Mon Sep 7 08:58:26 2015 -0700

 CHANGES.txt                    |   2 +
 doc/bookkeeperProtocol.textile | 115 ++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)
diff --git a/CHANGES.txt b/CHANGES.txt
index 8547c26..046fab3 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -89,6 +89,8 @@ Trunk (unreleased changes)
       BOOKKEEPER-863: Potential resource leak with unclosed LedgerManager in BookieShell
(Ted Yu via sijie)
+      BOOKKEEPER-802: Bookkeeper protocol documentation (ivank via sijie)
         BOOKKEEPER-810: Allow to configure TCP connect timeout (Charles Xie via sijie)
diff --git a/doc/bookkeeperProtocol.textile b/doc/bookkeeperProtocol.textile
new file mode 100644
index 0000000..d424c98
--- /dev/null
+++ b/doc/bookkeeperProtocol.textile
@@ -0,0 +1,115 @@
+Title:     BookKeeper Replication Protocol
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+This documents describes the bookkeeper replication protocol, and the guarantees it gives.
It assumes you have a general idea about leader election and log replication and how you can
use these in your system. If not, have a look at the bookkeeper "tutorial":./bookkeeperTutorial.html
+h1. Ledgers
+A ledger is the basic building block in Bookkeeper. All guarantees we provide are on ledgers.
A replicated log is composed of an ordered list of ledgers. See "From Ledgers to Logs":./bookkeeperLedgers2Logs.html
on how to build a replicated log from ledgers.
+Ledgers are composed of metadata and entries. The metadata is stored in a datastore which
provides a compare-and-swap operation (generally ZooKeeper). Entries are stored on storage
nodes known as bookies.
+A ledger has a single write and multiple readers (SWMR).
+A ledger's metadata contains:
+* id: a 64bit integer, unique within the system
+* ensemble size (E): the number of nodes the ledger is stored on.
+* write quorum size (Q[~w~]): the number of nodes each entry is written. In effect, the max
replication for an entry.
+* ack quorum size (Q[~a~]): the number of nodes a entry must be acknowledge on. In effect,
the min replication for an entry.
+* last entry: the last entry in the ledger, or NULL if state != CLOSED
+* one or more fragments, which each consist of:
+** First entry of fragment, list of bookies for fragment
+When creating the ledger, the following invariant must hold.
+   E >= Q[~w~] >= Q[~a~]
+h4. Ensembles
+When the ledger is created, E bookies are chosen for the entries of that ledger. The bookies
are the initial ensemble of the ledger. A ledger can have multiple ensembles, but an entry
has only one ensemble. Changes in the ensemble, involves a new fragment being added to the
+Take the following example. In this ledger, with ensemble size of 3, there are two fragments
and thus two ensembles, one starting at entry 0, the second at entry 12. The second ensemble
differs from the first only by its first element. This could be because bookie1 has failed
and therefore had to be replaced.
+table(table table-bordered table-hover).
+|_. FirstEntry |_. Bookies  |
+| 0            | B1, B2, B3 |
+| 12           | B4, B2, B3 |
+h4. Write Quorums
+Each entry in the log is written to Q[~w~] nodes. This is considered the write quorum for
that entry. The write quorum is the subsequence of the ensemble, Q[~w~] in length, and starting
at the bookie at index (entryid % E).
+For example, in a ledger of E = 4, Q[~w~] = 3 & Q[~a~] = 2, with an ensemble consisting
of B1, B2, B3 & B4, the write quorums for the first 6 entries will be.
+table(table table-bordered table-hover).
+|_. Entry |_. Write quorum |
+| 0       | B1, B2, B3     |
+| 1       | B2, B3, B4     |
+| 2       | B3, B4, B1     |
+| 3       | B4, B1, B2     |
+| 4       | B1, B2, B3     |
+| 5       | B2, B3, B4     |
+There are only E distinct write quorums in any ensemble. If Q[~w~] = Q[~a~], then there is
only one, as no striping occurs.  
+h4. Ack Quorums
+The ack quorum for an entry is any subset of the write quorum of size Q[~a~]. If Q[~a~] bookies
acknowledge an entry, it means it has been fully replicated.
+h4. Guarantees
+The system can tolerate Q[~a~] - 1 failures without data loss.
+Bookkeeper guarantees that:
+ 1. all updates to a ledger will be read in the same order as they  were written 
+ 2. all clients will read the same sequence of updates from the ledger
+h1. Writing to the ledger
+When an entry is written to a ledger, it is assigned an entry id the write quorum is calculated.
As there is only a single writer, ensuring that entry ids are sequential is trivial. A bookie
acknowledges a write once it has been persisted to disk and is therefore durable. Once Q[~a~]
bookies from the write quorum acknowledge the write, the write is acknowledged to the client,
but only if all entries with lower entry ids in the ledger have already been acknowledged
to the client.
+The entry written contains the ledger id, the entry id, the last add confirmed and the payload.
The last add confirmed is the last entry which had been acknowledged to the client when this
entry was written. Sending this with the entry speeds up recovery of the ledger in the case
that the writer crashes.
+Another client can also read entries in the ledger up as far as the last add confirmed, as
we guarantee that all entries thus far have been replicated on Q[~a~] nodes, and therefore
all future readers will be able to also read it. However, to read like this, the ledger should
be opened with a non-fencing open. Otherwise, it would kill the writer.
+If a node fails to acknowledge a write, the writer will create a new ensemble by replacing
the failed node in the current ensemble. It creates a new fragment with this ensemble, starting
from the first message that has not been acknowledged to the client. Creating the new fragment
involves making a CAS write to the metadata. If the CAS write fails, someone else has modified
something in the ledger metadata. This concurrent modification could have been caused by recovery
or rereplication[1]. We reread the metadata. If the state of the ledger is no longer OPEN,
we send an error to the client for any outstanding writes. Otherwise, we try to replace the
failed node again.
+h1. Closing a ledger as a writer
+Closing a ledger is straight forward for a writer. The writer makes a CAS write to the metadata,
changing the state to CLOSED, and setting the last entry of the ledger to the last entry which
we have acknowledged to the client.
+If the CAS write fails, it means someone else has modified the metadata. We reread the metadata,
and retry closing as long as the state of the ledger is still OPEN. If the state is IN_RECOVERY
we send an error to the client. If the state is CLOSED and the last entry is the same as the
last entry we have acknowledged to the client, we complete the close operation successfully.
If the last entry is different to what we have acknowledged to the client, we send an error
to the client.
+h1. Closing a ledger as a reader
+A reader can also force a ledger to close. Forcing the ledger to close will prevent any writer
from adding new entries to the ledger. This is called *Fencing*. This can occur when a writer
has crashed or has become unavailable, and a new writer wants to take over writing to the
log. The new writer must ensure that it has seen all updates from the previous writer, and
prevent the previous writer from making any new updates before making any updates of its own.
+To recover a ledger, we first update the state in the metadata to IN_RECOVERY. We then send
a fence message to all the bookies in the last fragment of the ledger. When a bookie receives
a fence message for a ledger, the fenced state of the ledger is persisted to disk. Once we
receive a response from at least (Q[~w~]-Q[~a~])+1 bookies from each write quorum in the ensemble,
the ledger is fenced.
+By ensuring we have received a response from at last (Q[~w~]-Q[~a~])+1 bookies in each write
quorum, we ensure that, if the old writer is alive and tries to add a new entry there will
be no write quorum in which Q[~a~] bookies will accept the write. If the old writer tries
to update the ensemble, it will fail on the CAS metadata write, and then see that the ledger
is in IN_RECOVERY state, and that it therefore shouldn't try to write to it.
+The old writer will be able to write entries to individual bookies (we can't guarantee that
the fence message reaches all bookies), but as it will not be able reach ack quorum, it will
not be able to send a success response to its client. The client will get a LedgerFenced error
+It is important to note that when you get a ledger fenced message for an entry, it doesn't
mean that the entry has _not_ been written. It means that the entry may or may not have been
written, and this can only be determined after the ledger is recovered. In effect, LedgerFenced
should be treated like a timeout.
+Once the ledger is fenced, recovery can begin. Recovery means finding the last entry of the
ledger and closing the ledger. To find the last entry of the ledger, the client asks all bookies
for the highest last add confirmed value they have seen. It waits until it has received a
response at least (Q[~w~]-Q[~a~])+1 bookies from each write quorum, and takes the highest
response as the entry id to start reading forward from. It then starts reading forward in
the ledger, one entry at a time, replicating all entries it sees to the entire write quorum
for that entry. Once it can no longer read any more entries, it updates the state in the metadata
to CLOSED, and sets the last entry of the ledger to the last entry it wrote. Multiple readers
can try to recovery a ledger at the same time, but as the metadata write is CAS, they will
all converge on the same last entry of the ledger.
+fn1. Rereplication is a subsystem that runs in the background on bookies to ensure that ledgers
are fully replicated even if one bookie from their ensemble is down

View raw message