bookkeeper-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mme...@apache.org
Subject svn commit: r1743979 [34/35] - in /bookkeeper/site/trunk: content/ content/docs/r4.4.0/ content/docs/r4.4.0/apidocs/ content/docs/r4.4.0/apidocs/org/ content/docs/r4.4.0/apidocs/org/apache/ content/docs/r4.4.0/apidocs/org/apache/bookkeeper/ content/doc...
Date Sun, 15 May 2016 21:38:39 GMT
Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookieConfigParams.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookieConfigParams.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookieConfigParams.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookieConfigParams.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,94 @@
+Title:        Bookie Configuration Parameters
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h1. Bookie Configuration Parameters
+
+This page contains detailed information about configuration parameters used for configuring a bookie server. There is an example in "bookkeeper-server/conf/bk_server.conf". 
+
+h2. Server parameters
+
+| @bookiePort@        |Port that bookie server listens on. The default value is 3181.|
+| @journalDirectory@        | Directory to which Bookkeeper outputs its write ahead log, ideally on a dedicated device. The default value is "/tmp/bk-txn". |
+| @ledgerDirectories@        | Directory to which Bookkeeper outputs ledger snapshots.  Multiple directories can be defined, separated by comma, e.g. /tmp/bk1-data,/tmp/bk2-data. Ideally ledger dirs and journal dir are each on a different device, which reduces the contention between random I/O and sequential writes. It is possible to run with a single disk,  but performance will be significantly lower.|
+| @indexDirectories@  | Directories to store index files. If not specified, bookie will use ledgerDirectories to store index files. |
+| @bookieDeathWatchInterval@ | Interval to check whether a bookie is dead or not, in milliseconds. |
+| @gcWaitTime@        | Interval to trigger next garbage collection, in milliseconds. Since garbage collection is running in the background, running the garbage collector too frequently hurts performance. It is best to set its value high enough if there is sufficient disk capacity.|
+| @flushInterval@ | Interval to flush ledger index pages to disk, in milliseconds. Flushing index files will introduce random disk I/O. Consequently, it is important to have journal dir and ledger dirs each on different devices. However,  if it necessary to have journal dir and ledger dirs on the same device, one option is to increment the flush interval to get higher performance. Upon a failure, the bookie will take longer to recover. |
+| @numAddWorkerThreads@ | Number of threads that should handle write requests. if zero, the writes would be handled by netty threads directly. |
+| @numReadWorkerThreads@ | Number of threads that should handle read requests. if zero, the reads would be handled by netty threads directly. |
+
+h2. NIO server settings
+
+| @serverTcpNoDelay@ | This settings is used to enabled/disabled Nagle's algorithm, which is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. If you are sending many small messages, such that more than one can fit in a single IP packet, setting server.tcpnodelay to false to enable Nagle algorithm can provide better performance. Default value is true. |
+
+h2. Journal settings
+
+| @journalMaxSizeMB@  |  Maximum file size of journal file, in megabytes. A new journal file will be created when the old one reaches the file size limitation. The default value is 2kB. |
+| @journalMaxBackups@ |  Max number of old journal file to keep. Keeping a number of old journal files might help data recovery in some special cases. The default value is 5. |
+| @journalPreAllocSizeMB@ | The space that bookie pre-allocate at a time in the journal. |
+| @journalWriteBufferSizeKB@ | Size of the write buffers used for the journal. |
+| @journalRemoveFromPageCache@ | Whether bookie removes pages from page cache after force write. Used to avoid journal pollute os page cache. |
+| @journalAdaptiveGroupWrites@ | Whether to group journal force writes, which optimize group commit for higher throughput. |
+| @journalMaxGroupWaitMSec@ | Maximum latency to impose on a journal write to achieve grouping. |
+| @journalBufferedWritesThreshold@ | Maximum writes to buffer to achieve grouping. |
+| @journalFlushWhenQueueEmpty@ | Whether to flush the journal when journal queue is empty. Disabling it would provide sustained journal adds throughput. |
+| @numJournalCallbackThreads@ | The number of threads that should handle journal callbacks. |
+
+h2. Ledger cache settings
+
+| @openFileLimit@ | Maximum number of ledger index files that can be opened in a bookie. If the number of ledger index files reaches this limit, the bookie starts to flush some ledger indexes from memory to disk. If flushing happens too frequently, then performance is affected. You can tune this number to improve performance according. |
+| @pageSize@ | Size of an index page in ledger cache, in bytes. A larger index page can improve performance when writing page to disk, which is efficient when you have small number of ledgers and these ledgers have a similar number of entries. With a large number of ledgers and a few entries per ledger, a smaller index page would improves memory usage. |
+| @pageLimit@ | Maximum number of index pages to store in the ledger cache. If the number of index pages reaches this limit, bookie server starts to flush ledger indexes from memory to disk. Incrementing this value is an option when flushing becomes frequent. It is important to make sure, though, that pageLimit*pageSize is not more than JVM max memory limit; otherwise it will raise an OutOfMemoryException. In general, incrementing pageLimit, using smaller index page would gain better performance in the case of a large number of ledgers with few entries per ledger. If pageLimit is -1, a bookie uses 1/3 of the JVM memory to compute the maximum number of index pages. |
+
+h2. Ledger manager settings
+
+| @ledgerManagerType@ | What kind of ledger manager is used to manage how ledgers are stored, managed and garbage collected. See "BookKeeper Internals":./bookkeeperInternals.html for detailed info. Default is flat. |
+| @zkLedgersRootPath@ | Root zookeeper path to store ledger metadata. Default is /ledgers. |
+
+h2. Entry Log settings
+
+| @logSizeLimit@      | Maximum file size of entry logger, in bytes. A new entry log file will be created when the old one reaches the file size limitation. The default value is 2GB. |
+| @entryLogFilePreallocationEnabled@ | Enable/Disable entry logger preallocation. Enable this would provide sustained higher throughput and reduce latency impaction. |
+| @readBufferSizeBytes@ | The number of bytes used as capacity for BufferedReadChannel. Default is 512 bytes. |
+| @writeBufferSizeBytes@ | The number of bytes used as capacity for the write buffer. Default is 64KB. |
+
+h2. Entry Log compaction settings
+
+| @minorCompactionInterval@ | Interval to run minor compaction, in seconds. If it is set to less than or equal to zero, then minor compaction is disabled. Default is 1 hour. |
+| @minorCompactionThreshold@ | Entry log files with remaining size under this threshold value will be compacted in a minor compaction. If it is set to less than or equal to zero, the minor compaction is disabled. Default is 0.2 |
+| @majorCompactionInterval@ | Interval to run major compaction, in seconds. If it is set to less than or equal to zero, then major compaction is disabled. Default is 1 day. |
+| @majorCompactionThreshold@ | Entry log files with remaining size below this threshold value will be compacted in a major compaction. Those entry log files whose remaining size percentage is still higher than the threshold value will never be compacted. If it is set to less than or equal to zero, the major compaction is disabled. Default is 0.8. |
+| @compactionMaxOutstandingRequests@ | The maximum number of entries which can be compacted without flushing. When compacting, the entries are written to the entrylog and the new offsets are cached in memory. Once the entrylog is flushed the index is updated with the new offsets. This parameter controls the number of entries added to the entrylog before a flush is forced. A higher value for this parameter means more memory will be used for offsets. Each offset consist of 3
+longs. This parameter should _not_ be modified unless you know what you're doing. |
+| @compactionRate@ | The rate at which compaction will re-add entries. The unit is adds per second. |
+
+h2. Statistics
+
+| @enableStatistics@ | Enables the collection of statistics. Default is on. |
+
+h2. Auto-replication
+
+| @openLedgerRereplicationGracePeriod@ | This is the grace period which the rereplication worker waits before fencing and replicating a ledger fragment which is still being written to upon a bookie failure. The default is 30s. |
+
+h2. Read-only mode support
+
+| @readOnlyModeEnabled@ | Enables/disables the read-only Bookie feature. A bookie goes into read-only mode when it finds integrity issues with stored data. If @readOnlyModeEnabled@ is false, the bookie shuts down if it finds integrity issues. By default it is enabled. |
+
+h2. Disk utilization
+
+| @diskUsageThreshold@ | Fraction of the total utilized usable disk space to declare the disk full. The total available disk space is obtained with File.getUsableSpace(). Default is 0.95. |
+| @diskCheckInterval@ | Interval between consecutive checks of disk utilization. Default is 10s. |
+
+h2. ZooKeeper parameters
+
+| @zkServers@ | A list of one or more servers on which zookeeper is running. The server list is comma separated, e.g., zk1:2181,zk2:2181,zk3:2181 |
+| @zkTimeout@ | ZooKeeper client session timeout in milliseconds. Bookie server will exit if it received SESSION_EXPIRED because it was partitioned off from ZooKeeper for more than the session timeout. JVM garbage collection or disk I/O can cause SESSION_EXPIRED. Increment this value could help avoiding this issue. The default value is 10,000. |

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookieRecovery.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookieRecovery.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookieRecovery.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookieRecovery.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,79 @@
+Title:     BookKeeper Bookie Recovery
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+h1. Bookie Ledger Recovery
+
+p. When a Bookie crashes, any ledgers with data on that Bookie become underreplicated. There are two options for bringing the ledgers back to full replication, Autorecovery and Manual Bookie Recovery.
+
+h2. Autorecovery
+
+p. Autorecovery runs as a daemon alongside the Bookie daemon on each Bookie. Autorecovery detects when a bookie in the cluster has become unavailable, and rereplicates all the ledgers which were on that bookie, so that those ledgers are brough back to full replication. See the "Admin Guide":./bookkeeperConfig.html for instructions on how to start autorecovery.
+
+h2. Manual Bookie Recovery
+
+p. If autorecovery is not enabled, it is possible for the adminisatrator to manually rereplicate the data from the failed bookie.
+
+To run recovery, with zk1.example.com as the zookeeper ensemble, and 192.168.1.10 as the failed bookie, do the following:
+
+@bookkeeper-server/bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 192.168.1.10:3181@
+
+It is necessary to specify the host and port portion of failed bookie, as this is how it identifies itself to zookeeper. It is possible to specify a third argument, which is the bookie to replicate to. If this is omitted, as in our example, a random bookie is chosen for each ledger segment. A ledger segment is a continuous sequence of entries in a bookie, which share the same ensemble.
+
+h2. AutoRecovery Internals
+
+Auto-Recovery has two components:
+
+* *Auditor*, a singleton node which watches for bookie failure, and creates rereplication tasks for the ledgers on failed bookies.
+* *ReplicationWorker*, runs on each Bookie, takes rereplication tasks and executes them.
+
+Both the components run as threads in the the *AutoRecoveryMain* process. The *AutoRecoveryMain* process runs on each Bookie in the cluster. All recovery nodes will participate in leader election to decide which node becomes the auditor. Those which fail to become the auditor will watch the elected auditor, and will run election again if they see that it has failed.
+
+h3. Auditor
+
+The auditor watches the the list of bookies registered with ZooKeeper in the cluster. A Bookie registers with ZooKeeper during startup. If the bookie crashes or is killed, the bookie's registration disappears. The auditor is notified of changes in the registered bookies list.
+
+When the auditor sees that a bookie has disappeared from the list, it immediately scans the complete ledger list to find ledgers which have stored data on the failed bookie. Once it has a list of ledgers which need to be rereplicated, it will publish a rereplication task for each ledger under the /underreplicated/ znode in ZooKeeeper.
+
+h3. ReplicationWorker
+
+Each replication worker watches for tasks being published in the /underreplicated/ znode. When a new task appears, it will try to get a lock on it. If it cannot acquire the lock, it tries the next entry. The locks are implemented using ZooKeeper ephemeral znodes.
+
+The replication worker will scan through the rereplication task's ledger for segments of which its local bookie is not a member. When it finds segments matching this criteria it will replicate the entries of that segment to the local bookie.  If, after this process, the ledger is fully replicated, the ledgers entry under /underreplicated/ is deleted, and the lock is released. If there is a problem replicating, or there are still segments in the ledger which are still underreplicated (due to the local bookie already being part of the ensemble for the segment), then the lock is simply released.
+
+If the replication worker finds a segment which needs rereplication, but does not have a defined endpoint (i.e. the final segment of a ledger currently being written to), it will wait for a grace period before attempting rereplication. If the segment needing rereplciation still does not have a defined endpoint, the ledger is fenced and rereplication then takes place. This avoids the case where a client is writing to a ledger, and one of the bookies goes down, but the client has not written an entry to that bookie before rereplication takes place. The client could continue writing to the old segment, even though the ensemble for the segment had changed. This could lead to data loss. Fencing prevents this scenario from happening. In the normal case, the client will try to write to the failed bookie within the grace period, and will have started a new segment before rereplication starts. See the "Admin Guide":./bookkeeperConfig.html for how to configure this grace period.
+
+h2. The Rereplication process
+
+The ledger rereplication process is as follows.
+
+# The client goes through all ledger segments in the ledger, selecting those which contain the failed bookie;
+# A recovery process is initiated for each ledger segment in this list;
+## The client selects a bookie to which all entries in the ledger segment will be replicated; In the case of autorecovery, this will always be the local bookie;
+## the client reads entries that belong to the ledger segment from other bookies in the ensemble and writes them to the selected bookie;
+## Once all entries have been replicated, the zookeeper metadata for the segment is updated to reflect the new ensemble;
+## The segment is marked as fully replicated in the recovery tool;
+# Once all ledger segments are marked as fully replicated, the ledger is marked as fully replicated.
+
+h2. The Manual Bookie Recovery process
+
+The manual bookie recovery process is as follows.
+
+# The client reads the metadata of active ledgers from zookeeper;
+# From this, the ledgers which contain segments using the failed bookie in their ensemble are selected;
+# A recovery process is initiated for each ledger in this list;
+## The Ledger rereplication process is run for each ledger;
+# Once all ledgers are marked as fully replicated, bookie recovery is finished.

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfig.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfig.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfig.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfig.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,167 @@
+Title:        BookKeeper Administrator's Guide
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+        .
+
+h1. Abstract
+
+This document contains information about deploying, administering and maintaining BookKeeper. It also discusses best practices and common problems. 
+
+h1. Running a BookKeeper instance
+
+h2. System requirements
+
+A typical BookKeeper installation comprises a set of bookies and a set of ZooKeeper replicas. The exact number of bookies depends on the quorum mode, desired throughput, and number of clients using this installation simultaneously. The minimum number of bookies is three for self-verifying (stores a message authentication code along with each entry) and four for generic (does not store a message authentication code with each entry), and there is no upper limit on the number of bookies. Increasing the number of bookies will, in fact, enable higher throughput.
+
+For performance, we require each server to have at least two disks. It is possible to run a bookie with a single disk, but performance will be significantly lower in this case.
+
+For ZooKeeper, there is no constraint with respect to the number of replicas. Having a single machine running ZooKeeper in standalone mode is sufficient for BookKeeper. For resilience purposes, it might be a good idea to run ZooKeeper in quorum mode with multiple servers. Please refer to the ZooKeeper documentation for detail on how to configure ZooKeeper with multiple replicas.
+
+h2. Starting and Stopping Bookies
+
+To *start* a bookie, execute the following command:
+
+* To run a bookie in the foreground:
+@bookkeeper-server/bin/bookkeeper bookie@
+
+* To run a bookie in the background:
+@bookkeeper-server/bin/bookkeeper-daemon.sh start bookie@
+
+The configuration parameters can be set in bookkeeper-server/conf/bk_server.conf.
+
+The important parameters are:
+
+* @bookiePort@, Port number that the bookie listens on; 
+* @zkServers@, Comma separated list of ZooKeeper servers with a hostname:port format; 
+* @journalDir@, Path for Log Device (stores bookie write-ahead log); 
+* @ledgerDir@, Path for Ledger Device (stores ledger entries); 
+
+Ideally, @journalDir@ and @ledgerDir@ are each in a different device. See "Bookie Configuration Parameters":./bookieConfigParams.html for a full list of configuration parameters.
+
+To *stop* a bookie running in the background, execute the following command:
+
+@bookkeeper-server/bin/bookkeeper-daemon.sh stop bookie [-force]@
+@-force@ is optional, which is used to stop the bookie forcefully, if the bookie server is not stopped gracefully within the _BOOKIE_STOP_TIMEOUT_ (environment variable), which is 30 seconds, by default.
+
+h3. Upgrading
+
+From time to time, we may make changes to the filesystem layout of the bookie, which are incompatible with previous versions of bookkeeper and require that directories used with previous versions are upgraded. If you upgrade your bookkeeper software, and an upgrade is required, then the bookie will fail to start and print an error such as:
+
+@2012-05-25 10:41:50,494 - ERROR - [main:Bookie@246] - Directory layout version is less than 3, upgrade needed@
+
+BookKeeper provides a utility for upgrading the filesystem.
+@bookkeeper-server/bin/bookkeeper upgrade@
+
+The upgrade application takes 3 possible switches, @--upgrade@, @--rollback@ or @--finalize@. A normal upgrade process looks like.
+
+# @bookkeeper-server/bin/bookkeeper upgrade --upgrade@
+# @bookkeeper-server/bin/bookkeeper bookie@
+# Check everything is working. Kill bookie, ^C
+# If everything is ok, @bookkeeper-server/bin/bookkeeper upgrade --finalize@
+# Start bookie again @bookkeeper-server/bin/bookkeeper bookie@
+# If something is amiss, you can roll back the upgrade @bookkeeper-server/bin/bookkeeper upgrade --rollback@
+
+h3. Formatting
+
+To format the bookie metadata in Zookeeper, execute the following command once:
+
+@bookkeeper-server/bin/bookkeeper shell metaformat [-nonInteractive] [-force]@
+
+To format the bookie local filesystem data, execute the following command on each bookie node:
+
+@bookkeeper-server/bin/bookkeeper shell bookieformat [-nonInteractive] [-force]@
+
+The @-nonInteractive@ and @-force@ switches are optional.
+
+If @-nonInteractive@ is set, the user will not be asked to confirm the format operation if old data exists. If it exists, the format operation will abort, unless the @-force@ switch has been specified, in which case it will process.
+
+By default, the user will be prompted to confirm the format operation if old data exists.
+
+h3. Logging
+
+BookKeeper uses "slf4j":http://www.slf4j.org for logging, with the log4j bindings enabled by default. To enable logging from a bookie, create a log4j.properties file and point the environment variable BOOKIE_LOG_CONF to the configuration file. The path to the log4j.properties file must be absolute.
+
+@export BOOKIE_LOG_CONF=/tmp/log4j.properties@
+@bookkeeper-server/bin/bookkeeper bookie@
+
+h3. Missing disks or directories
+
+Replacing disks or removing directories accidentally can cause a bookie to fail while trying to read a ledger fragment which the ledger metadata has claimed exists on the bookie. For this reason, when a bookie is started for the first time, it's disk configuration is fixed for the lifetime of that bookie. Any change to the disk configuration of the bookie, such as a crashed disk or an accidental configuration change, will result in the bookie being unable to start with the following error:
+
+@2012-05-29 18:19:13,790 - ERROR - [main:BookieServer@314] - Exception running bookie server : @
+@org.apache.bookkeeper.bookie.BookieException$InvalidCookieException@
+@.......at org.apache.bookkeeper.bookie.Cookie.verify(Cookie.java:82)@
+@.......at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:275)@
+@.......at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:351)@
+
+If the change was the result of an accidental configuration change, the change can be reverted and the bookie can be restarted. However, if the change cannot be reverted, such as is the case when you want to add a new disk or replace a disk, the bookie must be wiped and then all its data re-replicated onto it. To do this, do the following:
+
+# Increment the _bookiePort_ in _bk_server.conf_.
+# Ensure that all directories specified by _journalDirectory_ and _ledgerDirectories_ are empty.
+# Start the bookie.
+# Run @bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools <zkserver> <oldbookie> <newbookie>@ to re-replicate data. <oldbookie> and <newbookie> are identified by their external IP and bookiePort. For example if this process is being run on a bookie with an external IP of 192.168.1.10, with an old _bookiePort_ of 3181 and a new _bookiePort_ of 3182, and with zookeeper running on _zk1.example.com_, the command to run would be <br/>@bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com 192.168.1.10:3181 192.168.1.10:3182@. See "Bookie Recovery":./bookieRecovery.html for more details on the re-replication process.
+
+The mechanism to prevent the bookie from starting up in the case of configuration changes exists to prevent the following silent failures:
+
+# A strict subset of the ledger devices (among multiple ledger devices) has been replaced, consequently making the content of the replaced devices unavailable;
+# A strict subset of the ledger directories has been accidentally deleted.
+
+h3. Full or failing disks
+
+A bookie can go into read-only mode if it detects problems with its disks. In read-only mode, the bookie will serve read requests, but will not allow any writes. Any ledger currently writing to the bookie will replace the bookie in its ensemble. No new ledgers will select the read-only bookie for writing.
+
+The bookie goes into read-only mode in the following conditions.
+
+# All disks are full.
+# An error occurred flushing to the ledger disks.
+# An error occurred writing to the journal disk.
+
+Important parameters are:
+
+* @readOnlyModeEnabled@, whether read-only mode is enabled. If read-only mode is not enabled, the bookie will shutdown on encountering any of the above conditions. By default, read-only mode is disabled.
+* @diskUsageThreshold@, percentage threshold at which a disk will be considered full. This value must be between 0 and 1.0. By default, the value is 0.95.
+* @diskCheckInterval@, interval at which the disks are checked to see if they are full. Specified in milliseconds. By default the check occurs every 10000 milliseconds (10 seconds).
+
+h2. Running Autorecovery nodes
+
+To run autorecovery nodes, we execute the following command in every Bookie node:
+ @bookkeeper-server/bin/bookkeeper autorecovery@
+
+Configuration parameters for autorecovery can be set in *bookkeeper-server/conf/bk_server.conf*.
+
+Important parameters are:
+
+* @auditorPeriodicCheckInterval@, interval at which the auditor will do a check of all ledgers in the cluster. By default this runs once a week. The interval is set in seconds. To disable the periodic check completely, set this to 0. Note that periodic checking will put extra load on the cluster, so it should not be run more frequently than once a day.
+
+* @rereplicationEntryBatchSize@ specifies the number of entries which a replication will rereplicate in parallel. The default value is 10. A larger value for this parameter will increase the speed at which autorecovery occurs but will increase the memory requirement of the autorecovery process, and create more load on the cluster.
+
+* @openLedgerRereplicationGracePeriod@, is the amount of time, in milliseconds, which a recovery worker will wait before recovering a ledger segment which has no defined ended, i.e. the client is still writing to that segment. If the client is still active, it should detect the bookie failure, and start writing to a new ledger segment, and a new ensemble, which doesn't include the failed bookie. Creating new ledger segment will define the end of the previous segment. If, after the grace period, the ledger segment's end has not been defined, we assume the writing client has crashed. The ledger is fenced and the client is blocked from writing any more entries to the ledger. The default value is 30000ms.
+
+
+h3. Disabling Autorecovery during maintenance
+
+It is useful to disable autorecovery during maintenance, for example, to avoid a Bookie's data being unnecessarily rereplicated when it is only being taken down for a short period to update the software, or change the configuration.
+
+To disable autorecovery, run:
+@bookkeeper-server/bin/bookkeeper shell autorecovery -disable@
+
+To reenable, run:
+@bookkeeper-server/bin/bookkeeper shell autorecovery -enable@
+
+Autorecovery enable/disable only needs to be run once for the whole cluster, and not individually on each Bookie in the cluster.
+
+h2. Setting up a test ensemble
+
+Sometimes it is useful to run a ensemble of bookies on your local machine for testing. We provide a utility for doing this. It will set up N bookies, and a zookeeper instance locally. The data on these bookies and of the zookeeper instance are not persisted over restarts, so obviously this should never be used in a production environment. To run a test ensemble of 10 bookies, do the following:
+
+@bookkeeper-server/bin/bookkeeper localbookie 10@
+

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfigParams.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfigParams.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfigParams.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperConfigParams.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,39 @@
+Title:        BookKeeper Configuration Parameters
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h1. BookKeeper Configuration Parameters
+
+This page contains detailed information about configuration parameters used for configuring a BookKeeper client.
+
+h3. General parameters
+
+| @zkServers@ | A list of one of more servers on which zookeeper is running. The server list can be comma separated values, e.g., zk1:2181,zk2:2181,zk3:2181 |
+| @zkTimeout@ | ZooKeeper client session timeout in milliseconds. The default value is 10,000. |
+| @throttle@ | A throttle value is used to prevent running out of memory when producing too many requests than the capability of bookie servers can handle. The default is 5,000. |
+| @readTimeout@ | This is the number of seconds bookkeeper client wait without hearing a response from a bookie before client consider it failed. The default is 5 seconds. |
+| @numWorkerThreads@ | This is the number of worker threads used by bookkeeper client to submit operations. The default value is the number of available processors. |
+
+h3. NIO server settings
+
+| @clientTcpNoDelay@ | This settings is used to enabled/disabled Nagle's algorithm, which is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. If you are sending many small messages, such that more than one can fit in a single IP packet, setting server.tcpnodelay to false to enable Nagle algorithm can provide better performance. Default value is true. |
+
+h3. Ledger manager settings
+
+| @ledgerManagerType@ | This parameter determines the type of ledger manager used to manage how ledgers are stored, manipulated, and garbage collected. See "BookKeeper Internals":./bookkeeperInternals.html for detailed info. Default value is flat. |
+| @zkLedgersRootPath@ | Root zookeeper path to store ledger metadata. Default is /ledgers. |
+
+h3. Bookie recovery settings
+
+Currently bookie recovery tool needs a digest type and passwd to open ledgers to do recovery. Currently, bookkeeper assumes that all ledgers were created with the same DigestType and Password. In the future, it needs to know for each ledger, what was the DigestType and Password used to create it before opening it.
+
+| @digestType@ | Digest type used to open ledgers from bookie recovery tool. |
+| @passwd@ | Password used to open ledgers from bookie recovery tool. |

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperInternals.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperInternals.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperInternals.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperInternals.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,84 @@
+Title:        BookKeeper Internals
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h2. Bookie Internals
+
+p. Bookie server stores its data in multiple ledger directories and its journal files in a journal directory. Ideally, storing journal files in a separate directory than data files would increase throughput and decrease latency
+
+h3. The Bookie Journal
+
+p. Journal directory has one kind of file in it:
+
+* @{timestamp}.txn@ - holds transactions executed in the bookie server.
+
+p. Before persisting ledger index and data to disk, a bookie ensures that the transaction that represents the update is written to a journal in non-volatile storage. A new journal file is created using current timestamp when a bookie starts or an old journal file reaches its maximum size.
+
+p. A bookie supports journal rolling to remove old journal files. In order to remove old journal files safely, bookie server records LastLogMark in Ledger Device, which indicates all updates (including index and data) before LastLogMark has been persisted to the Ledger Device.
+
+p. LastLogMark contains two parts:
+
+* @LastLogId@ - indicates which journal file the transaction persisted.
+* @LastLogPos@ - indicates the position the transaction persisted in LastLogId journal file.
+
+p. You may use following settings to further fine tune the behavior of journalling on bookies:
+
+| @journalMaxSizeMB@ | journal file size limitation. when a journal reaches this limitation, it will be closed and new journal file be created. |
+| @journalMaxBackups@ | how many old journal files whose id is less than LastLogMark 's journal id. |
+
+bq. NOTE: keeping number of old journal files would be useful for manually recovery in special case.
+
+h1. ZooKeeper Metadata
+
+p. For BookKeeper, we require a ZooKeeper installation to store metadata, and to pass the list of ZooKeeper servers as parameter to the constructor of the BookKeeper class (@org.apache.bookkeeper.client.BookKeeper@). To setup ZooKeeper, please check the "ZooKeeper documentation":http://zookeeper.apache.org/doc/trunk/index.html. 
+
+p. BookKeeper provides two mechanisms to organize its metadata in ZooKeeper. By default, the @FlatLedgerManager@ is used, and 99% of users should never need to look at anything else. However, in cases where there are a lot of active ledgers concurrently, (> 50,000), @HierarchicalLedgerManager@ should be used. For so many ledgers, a hierarchical approach is needed due to a limit ZooKeeper places on packet sizes "JIRA Issue":https://issues.apache.org/jira/browse/BOOKKEEPER-39.
+
+| @FlatLedgerManager@ | All ledger metadata are placed as children in a single zookeeper path. |
+| @HierarchicalLedgerManager@ | All ledger metadata are partitioned into 2-level znodes. |
+
+h2. Flat Ledger Manager
+
+p. All ledgers' metadata are put in a single zookeeper path, created using zookeeper sequential node, which can ensure uniqueness of ledger id. Each ledger node is prefixed with 'L'.
+
+p. Bookie server manages its owned active ledgers in a hash map. So it is easy for bookie server to find what ledgers are deleted from zookeeper and garbage collect them. And its garbage collection flow is described as below:
+
+* Fetch all existing ledgers from zookeeper (@zkActiveLedgers@).
+* Fetch all ledgers currently active within the Bookie (@bkActiveLedgers@).
+* Loop over @bkActiveLedgers@ to find those ledgers which do not exist in @zkActiveLedgers@ and garbage collect them.
+
+h2. Hierarchical Ledger Manager
+
+p. @HierarchicalLedgerManager@ first obtains a global unique id from ZooKeeper using a EPHEMERAL_SEQUENTIAL znode.
+
+p. Since ZooKeeper sequential counter has a format of %10d -- that is 10 digits with 0 (zero) padding, i.e. "&lt;path&gt;0000000001", @HierarchicalLedgerManager@ splits the generated id into 3 parts :
+
+@{level1 (2 digits)}{level2 (4 digits)}{level3 (4 digits)}@
+
+p. These 3 parts are used to form the actual ledger node path used to store ledger metadata:
+
+@{ledgers_root_path}/{level1}/{level2}/L{level3}@
+
+p. E.g. Ledger 0000000001 is split into 3 parts 00, 0000, 00001, which is stored in znode /{ledgers_root_path}/00/0000/L0001. So each znode could have at most 10000 ledgers, which avoids the problem of the child list being larger than the maximum ZooKeeper packet size.
+
+p. Bookie server manages its active ledgers in a sorted map, which simplifies access to active ledgers in a particular (level1, level2) partition.
+
+p. Garbage collection in bookie server is processed node by node as follows:
+
+* Fetching all level1 nodes, by calling zk#getChildren(ledgerRootPath).
+** For each level1 nodes, fetching their level2 nodes :
+** For each partition (level1, level2) :
+*** Fetch all existed ledgers from zookeeper belonging to partition (level1, level2) (@zkActiveLedgers@).
+*** Fetch all ledgers currently active in the bookie which belong to partition (level1, level2) (@bkActiveLedgers@).
+*** Loop over @bkActiveLedgers@ to find those ledgers which do not exist in @zkActiveLedgers@, and garbage collect them.
+
+bq. NOTE: Hierarchical Ledger Manager is more suitable to manage large number of ledgers existed in BookKeeper.
+

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperJMX.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperJMX.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperJMX.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperJMX.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,32 @@
+Title:        BookKeeper JMX
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+        .
+
+h1. JMX
+
+Apache BookKeeper has extensive support for JMX, which allows viewing and managing a BookKeeper cluster.
+
+This document assumes that you have basic knowledge of JMX. See "Sun JMX Technology":http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/ page to get started with JMX.
+
+See the "JMX Management Guide":http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html for details on setting up local and remote management of VM instances. By default the included __bookkeeper__ script supports only local management - review the linked document to enable support for remote management (beyond the scope of this document).
+
+__Bookie Server__ is a JMX manageable server, which registers the proper MBeans during initialization to support JMX monitoring and management of the instance.
+
+h1. Bookie Server MBean Reference
+
+This table details JMX for a bookie server.
+
+| _.MBean | _.MBean Object Name | _.Description |
+| BookieServer | BookieServer_<port> | Represents a bookie server. Note that the object name includes bookie port that the server listens on. It is the root MBean for bookie server, which includes statistics for a bookie server. E.g. number packets sent/received, and statistics for add/read operations. |
+| Bookie | Bookie | Provide bookie statistics. Currently it just returns current journal queue length waiting to be committed. |
+| LedgerCache | LedgerCache | Provide ledger cache statistics. E.g. number of page cached in page cache, number of files opened for ledger index files. |

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperLedgers2Logs.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperLedgers2Logs.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperLedgers2Logs.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperLedgers2Logs.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,56 @@
+Title:     From Ledgers to Logs
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+This documents describes the bookkeeper replication protocol, and the guarantees it gives. It assumes you have a general idea about leader election and log replication and how you can use these in your system. If not, have a look at the bookkeeper "tutorial":https://github.com/ivankelly/bookkeeper-tutorial first.
+
+h1. Ledgers to Logs
+
+Bookkeeper provides a primitive, ledgers, which can be used to build a replicated log for your system. All guarantees provided by bookkeeper are on ledgers. You can learn about the guarantees of ledgers "here":./bookkeeperProtocol.html. Guarantees on the whole log can be built using the ledger guarantees and any consistent datastore with a compare-and-swap(CAS) primitive. In this case, we describe a log using zookeeper as the datastore, but others could theoretically be used. 
+
+A log in bookkeeper is built from a number of ledgers, with a fixed order. A ledger represents a single segment of the log. A ledger could be the whole period that one node was the leader, or there could be multiple ledgers for a single period of leadership. However, there can only ever been one leader that adds entries to a single ledger. Ledgers cannot be reopened for writing once they have been closed/recovered.
+
+It's important to note that bookkeeper doesn't provide leader election. You must use a system like Zookeeper for this.
+
+In many cases, leader election is really leader suggestion. Multiple nodes could think that they are leader at any one time. It is the job of the log to guarantee that only one can write changes to the system.
+
+h3. Opening a log
+
+Once a node thinks it is leader for a particular log, it must take the following steps.
+
+# read the list of ledgers for the log
+# fence the last 2 ledgers[1] in the list
+# create a new ledger
+# add the new ledger to the ledger list
+# write the new ledger list back to the datastore using a CAS operation.
+
+The fencing in step 2 and the compare-and-swap operation in step 5 prevents two nodes thinking they have leadership at any one time. Ledger fencing is described in "Bookkeeper Protocol":./bookkeeperProtocol.html. The compare-and-swap operation will fail if the list of ledgers has changed between reading it and writing back the new list. When the CAS operation fails, the leader must start at step 1 again. Even better, they should check that they are in fact still the leader with the system that is providing leader election. The protocol will work correctly without this step, though it will be able to make very little progress if two nodes think they are leader and are duelling for the log. 
+
+The node must not serve any writes until step 5 completes successfully.
+
+h3. Rolling ledgers
+
+The leader may wish to close the current ledger and open a new one every so often. Ledgers can only be deleted as a whole. If you don't roll the log, you won't be able to clean up old entries in the log without a leader change. By closing the current ledger and adding a new one, the leader allows the log to be truncated whenever that data is no longer needed. The steps for rolling the log is similar to those for creating a new ledger.  
+
+# create a new ledger
+# add the new ledger to the ledger list
+# write the new ledger list to the datastore using CAS
+# close the previous ledger
+
+By deferring the closing of the previous ledger until step 4, we can continue writing to the log while we perform metadata update operations to add the new ledger. This is safe as long as you fence the last _2_ ledgers when acquiring leadership.
+
+fn1. We fence 2 ledgers, as the write may be writing to the penultimate, while adding the last ledger to the ledger list.

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperMetadata.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperMetadata.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperMetadata.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperMetadata.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,40 @@
+Title:        BookKeeper Metadata Management
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .
+        .
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+        .
+
+h1. Metadata Management
+
+There are two kinds of metadata needs to be managed in BookKeeper: one is the __list of available bookies__, which is used to track server availability (ZooKeeper is designed naturally for this); while the other is __ledger metadata__, which could be handle by different kinds of key/value storages efficiently with __CAS (Compare And Set)__ semantics.
+
+__Ledger metadata__ is handled by __LedgerManager__ and can be plugged with various storage mediums.
+
+h2. Ledger Metadata Management
+
+The operations on the metadata of a ledger are quite straightforward. They are:
+
+* @createLedger@: create an new entry to store given ledger metadata. A unique id should be generated as the ledger id for the new ledger.
+* @removeLedgerMetadata@: remove the entry of a ledger from metadata store. A __Version__ object is provided to do conditional remove. If given __Version__ object doesn't match current __Version__ in metadata store, __MetadataVersionException__ should be thrown to indicate version confliction. __NoSuchLedgerExistsException__ should be returned if the ledger metadata entry doesn't exists.
+* @readLedgerMetadata@: read the metadata of a ledger from metadata store. The new __version__ should be set to the returned __LedgerMetadata__ object. __NoSuchLedgerExistsException__ should be returned if the entry of the ledger metadata doesn't exists.
+* @writeLedgerMetadata@: update the metadata of a ledger matching the given __Version__. The update should be rejected and __MetadataVersionException__ should be returned whe then given __Version__ doesn't match the current __Version__ in metadata store. __NoSuchLedgerExistsException__ should be returned if the entry of the ledger metadata doesn't exists. The version of the __LedgerMetadata__ object should be set to the new __Version__ generated by applying this update.
+* @asyncProcessLedgers@: loops through all existed ledgers in metadata store and applies a __Processor__. The __Processor__ provided is executed for each ledger. If a failure happens during iteration, the iteration should be teminated and __final callback__ triggered with failure. Otherwise, __final callback__ is triggered after all ledgers are processed. No ordering nor transactional guarantees need to be provided for in the implementation of this interface.
+* @getLedgerRanges@: return a list of ranges for ledgers in the metadata store. The ledger metadata itself does not need to be fetched. Only the ledger ids are needed. No ordering is required, but there must be no overlap between ledger ranges and each ledger range must be contain all the ledgers in the metadata store between the defined endpoint (i.e. a ledger range [x, y], all ledger ids larger or equal to x and smaller or equal to y should exist only in this range). __getLedgerRanges__ is used in the __ScanAndCompare__ gc algorithm.
+
+h1. How to choose a metadata storage medium for BookKeeper.
+
+From the interface, several requirements need to met before choosing a metadata storage medium for BookKeeper:
+
+* @Check and Set (CAS)@: The ability to do strict update according to specific conditional. Etc, a specific version (ZooKeeper) and same content (HBase).
+* @Optimized for Writes@: The metadata access pattern for BookKeeper is read first and continuous updates.
+* @Optimized for Scans@: Scans are required for a __ScanAndCompare__ gc algorithm.
+
+__ZooKeeper__ is the default implemention for BookKeeper metadata management, __ZooKeeper__ holds data in memory and provides filesystem-like namespace and also meets all the above requirements. __ZooKeeper__ could meet most of usages for BookKeeper. However, if you application needs to manage millions of ledgers, a more scalable solution would be __HBase__, which also meet the above requirements, but it more complicated to set up.

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperOverview.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperOverview.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperOverview.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperOverview.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,185 @@
+Title:        BookKeeper overview
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h1. Abstract
+
+This guide contains detailed information about using BookKeeper for logging. It discusses the basic operations BookKeeper supports, and how to create logs and perform basic read and write operations on these logs.
+
+h1. BookKeeper introduction
+
+p. BookKeeper is a replicated service to reliably log streams of records. In BookKeeper, servers are "bookies", log streams are "ledgers", and each unit of a log (aka record) is a "ledger entry". BookKeeper is designed to be reliable; bookies, the servers that store ledgers, can crash, corrupt data, discard data, but as long as there are enough bookies behaving correctly the service as a whole behaves correctly. 
+
+p. The initial motivation for BookKeeper comes from the namenode of HDFS. Namenodes have to log operations in a reliable fashion so that recovery is possible in the case of crashes. We have found the applications for BookKeeper extend far beyond HDFS, however. Essentially, any application that requires an append storage can replace their implementations with BookKeeper. BookKeeper has the advantage of writing efficiently, replicating for fault tolerance, and scaling throughput with the number of servers through striping. 
+
+p. At a high level, a bookkeeper client receives entries from a client application and stores it to sets of bookies, and there are a few advantages in having such a service: 
+
+* We can use hardware that is optimized for such a service. We currently believe that such a system has to be optimized only for disk I/O; 
+* We can have a pool of servers implementing such a log system, and shared among a number of servers; 
+* We can have a higher degree of replication with such a pool, which makes sense if the hardware necessary for it is cheaper compared to the one the application uses. 
+
+
+h1. In slightly more detail...
+
+p. BookKeeper implements highly available logs, and it has been designed with write-ahead logging in mind. Besides high availability due to the replicated nature of the service, it provides high throughput due to striping. As we write entries in a subset of bookies of an ensemble and rotate writes across available quorums, we are able to increase throughput with the number of servers for both reads and writes. Scalability is a property that is possible to achieve in this case due to the use of quorums. Other replication techniques, such as state-machine replication, do not enable such a property. 
+
+p. An application first creates a ledger before writing to bookies through a local BookKeeper client instance. Upon creating a ledger, a BookKeeper client writes metadata about the ledger to ZooKeeper. Each ledger currently has a single writer. This writer has to execute a close ledger operation before any other client can read from it. If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the opportunity of closing the ledger, then the next client that tries to open a ledger executes a procedure to recover it. As closing a ledger consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery procedure simply finds the last entry written correctly and writes it to ZooKeeper. 
+
+p. Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode for the ledger. 
+
+h1. Bookkeeper elements and concepts
+
+p. BookKeeper uses four basic elements: 
+
+*  _Ledger_ : A ledger is a sequence of entries, and each entry is a sequence of bytes. Entries are written sequentially to a ledger and at most once. Consequently, ledgers have an append-only semantics; 
+*  _BookKeeper client_ : A client runs along with a BookKeeper application, and it enables applications to execute operations on ledgers, such as creating a ledger and writing to it; 
+*  _Bookie_ : A bookie is a BookKeeper storage server. Bookies store the content of ledgers. For any given ledger L, we call an _ensemble_ the group of bookies storing the content of L. For performance, we store on each bookie of an ensemble only a fragment of a ledger. That is, we stripe when writing entries to a ledger such that each entry is written to sub-group of bookies of the ensemble. 
+*  _Metadata storage service_ : BookKeeper requires a metadata storage service to store information related to ledgers and available bookies. We currently use ZooKeeper for such a task. 
+
+
+h1. Bookkeeper initial design
+
+p. A set of bookies implements BookKeeper, and we use a quorum-based protocol to replicate data across the bookies. There are basically two operations to an existing ledger: read and append. Here is the complete API list (more detail "here":./bookkeeperProgrammer.html):
+
+* Create ledger: creates a new empty ledger; 
+* Open ledger: opens an existing ledger for reading; 
+* Add entry: adds a record to a ledger either synchronously or asynchronously; 
+* Read entries: reads a sequence of entries from a ledger either synchronously or asynchronously 
+
+
+p. There is only a single client that can write to a ledger. Once that ledger is closed or the client fails, no more entries can be added. (We take advantage of this behavior to provide our strong guarantees.) There will not be gaps in the ledger. Fingers get broken, people get roughed up or end up in prison when books are manipulated, so there is no deleting or changing of entries. 
+
+!images/bk-overview.jpg!
+p. A simple use of BookKeeper is to implement a write-ahead transaction log. A server maintains an in-memory data structure (with periodic snapshots for example) and logs changes to that structure before it applies the change. The application server creates a ledger at startup and store the ledger id and password in a well known place (ZooKeeper maybe). When it needs to make a change, the server adds an entry with the change information to a ledger and apply the change when BookKeeper adds the entry successfully. The server can even use asyncAddEntry to queue up many changes for high change throughput. BookKeeper meticulously logs the changes in order and call the completion functions in order.
+
+p. When the application server dies, a backup server will come online, get the last snapshot and then it will open the ledger of the old server and read all the entries from the time the snapshot was taken. (Since it doesn't know the last entry number it will use MAX_INTEGER). Once all the entries have been processed, it will close the ledger and start a new one for its use. 
+
+p. A client library takes care of communicating with bookies and managing entry numbers. An entry has the following fields: 
+
+|Field|Type|Description|
+|Ledger number|long|The id of the ledger of this entry|
+|Entry number|long|The id of this entry|
+|last confirmed ( _LC_ )|long|id of the last recorded entry|
+|data|byte[]|the entry data (supplied by application)|
+|authentication code|byte[]|Message authentication code that includes all other fields of the entry|
+
+
+p. The client library generates a ledger entry. None of the fields are modified by the bookies and only the first three fields are interpreted by the bookies. 
+
+p. To add to a ledger, the client generates the entry above using the ledger number. The entry number will be one more than the last entry generated. The _LC_ field contains the last entry that has been successfully recorded by BookKeeper. If the client writes entries one at a time, _LC_ is the last entry id. But, if the client is using asyncAddEntry, there may be many entries in flight. An entry is considered recorded when both of the following conditions are met: 
+
+* the entry has been accepted by a quorum of bookies 
+* all entries with a lower entry id have been accepted by a quorum of bookies 
+
+
+ _LC_ seems mysterious right now, but it is too early to explain how we use it; just smile and move on. 
+
+p. Once all the other fields have been field in, the client generates an authentication code with all of the previous fields. The entry is then sent to a quorum of bookies to be recorded. Any failures will result in the entry being sent to a new quorum of bookies. 
+
+p. To read, the client library initially contacts a bookie and starts requesting entries. If an entry is missing or invalid (a bad MAC for example), the client will make a request to a different bookie. By using quorum writes, as long as enough bookies are up we are guaranteed to eventually be able to read an entry. 
+
+h1. Bookkeeper metadata management
+
+p. There are some meta data that needs to be made available to BookKeeper clients: 
+
+* The available bookies; 
+* The list of ledgers; 
+* The list of bookies that have been used for a given ledger; 
+* The last entry of a ledger; 
+
+
+p. We maintain this information in ZooKeeper. Bookies use ephemeral nodes to indicate their availability. Clients use znodes to track ledger creation and deletion and also to know the end of the ledger and the bookies that were used to store the ledger. Bookies also watch the ledger list so that they can cleanup ledgers that get deleted. 
+
+h1. Closing out ledgers
+
+p. The process of closing out the ledger and finding the last entry is difficult due to the durability guarantees of BookKeeper: 
+
+* If an entry has been successfully recorded, it must be readable. 
+* If an entry is read once, it must always be available to be read. 
+
+
+p. If the ledger was closed gracefully, ZooKeeper will have the last entry and everything will work well. But, if the BookKeeper client that was writing the ledger dies, there is some recovery that needs to take place. 
+
+p. The problematic entries are the ones at the end of the ledger. There can be entries in flight when a BookKeeper client dies. If the entry only gets to one bookie, the entry should not be readable since the entry will disappear if that bookie fails. If the entry is only on one bookie, that doesn't mean that the entry has not been recorded successfully; the other bookies that recorded the entry might have failed. 
+
+p. The trick to making everything work is to have a correct idea of a last entry. We do it in roughly three steps: 
+
+# Find the entry with the highest last recorded entry, _LC_ ; 
+# Find the highest consecutively recorded entry, _LR_ ; 
+# Make sure that all entries between _LC_ and _LR_ are on a quorum of bookies; 
+
+h1. Data Management in Bookies
+
+p. This section gives an overview of how a bookie manages its ledger fragments. 
+
+h2. Basic
+
+p. Bookies manage data in a log-structured way, which is implemented using three kind of files:
+
+* _Journal_ : A journal file contains the BookKeeper transaction logs. Before any update takes place, a bookie ensures that a transaction describing the update is written to non-volatile storage. A new journal file is created once the bookie starts or the older journal file reaches the journal file size threshold.
+* _Entry Log_ : An entry log file manages the written entries received from BookKeeper clients. Entries from different ledgers are aggregated and written sequentially, while their offsets are kept as pointers in _LedgerCache_ for fast lookup. A new entry log file is created once the bookie starts or the older entry log file reaches the entry log size threshold. Old entry log files are removed by the _Garbage Collector Thread_ once they are not associated with any active ledger.
+* _Index File_ : An index file is created for each ledger, which comprises a header and several fixed-length index pages, recording the offsets of data stored in entry log files. 
+
+p. Since updating index files would introduce random disk I/O, for performance consideration, index files are updated lazily by a _Sync Thread_ running in the background. Before index pages are persisted to disk, they are gathered in _LedgerCache_ for lookup.
+
+* _LedgerCache_ : A memory pool caches ledger index pages, which more efficiently manage disk head scheduling.
+
+h2. Add Entry
+
+p. When a bookie receives entries from clients to be written, these entries will go through the following steps to be persisted to disk:
+
+# Append the entry in _Entry Log_, return its position { logId , offset } ;
+# Update the index of this entry in _Ledger Cache_ ;
+# Append a transaction corresponding to this entry update in _Journal_ ;
+# Respond to BookKeeper client ;
+
+* For performance reasons, _Entry Log_ buffers entries in memory and commit them in batches, while _Ledger Cache_ holds index pages in memory and flushes them lazily. We will discuss data flush and how to ensure data integrity in the following section 'Data Flush'.
+
+h2. Data Flush
+
+p. Ledger index pages are flushed to index files in the following two cases:
+
+# _LedgerCache_ memory reaches its limit. There is no more space available to hold newer index pages. Dirty index pages will be evicted from _LedgerCache_ and persisted to index files.
+# A background thread _Sync Thread_ is responsible for flushing index pages from _LedgerCache_ to index files periodically.
+
+p. Besides flushing index pages, _Sync Thread_ is responsible for rolling journal files in case that journal files use too much disk space. 
+
+p. The data flush flow in _Sync Thread_ is as follows:
+
+# Records a _LastLogMark_ in memory. The _LastLogMark_ contains two parts: first one is _txnLogId_ (file id of a journal) and the second one is _txnLogPos_ (offset in a journal). The _LastLogMark_ indicates that those entries before it have been persisted to both index and entry log files.
+# Flushes dirty index pages from _LedgerCache_ to index file, and flushes entry log files to ensure all buffered entries in entry log files are persisted to disk.
+#* Ideally, a bookie just needs to flush index pages and entry log files that contains entries before _LastLogMark_. There is no such information in _LedgerCache_ and _Entry Log_ mapping to journal files, though. Consequently, the thread flushes _LedgerCache_ and _Entry Log_ entirely here, and may flush entries after the _LastLogMark_. Flushing more is not a problem, though, just redundant.
+# Persists _LastLogMark_ to disk, which means entries added before _LastLogMark_ whose entry data and index page were also persisted to disk. It is the time to safely remove journal files created earlier than _txnLogId_.
+#* If the bookie has crashed before persisting _LastLogMark_ to disk, it still has journal files containing entries for which index pages may not have been persisted. Consequently, when this bookie restarts, it inspects journal files to restore those entries; data isn't lost.
+
+p. Using the above data flush mechanism, it is safe for the _Sync Thread_ to skip data flushing when the bookie shuts down. However, in _Entry Logger_, it uses _BufferedChannel_ to write entries in batches and there might be data buffered in _BufferedChannel_ upon a shut down. The bookie needs to ensure _Entry Logger_ flushes its buffered data during shutting down. Otherwise, _Entry Log_ files become corrupted with partial entries.
+
+p. As described above, _EntryLogger#flush_ is invoked in the following two cases:
+* in _Sync Thread_ : used to ensure entries added before _LastLogMark_ are persisted to disk.
+* in _ShutDown_ : used to ensure its buffered data persisted to disk to avoid data corruption with partial entries.
+
+h2. Data Compaction
+
+p. In bookie server, entries of different ledgers are interleaved in entry log files. A bookie server runs a _Garbage Collector_ thread to delete un-associated entry log files to reclaim disk space. If a given entry log file contains entries from a ledger that has not been deleted, then the entry log file would never be removed and the occupied disk space never reclaimed. In order to avoid such a case, a bookie server compacts entry log files in _Garbage Collector_ thread to reclaim disk space.
+
+p. There are two kinds of compaction running with different frequency, which are _Minor Compaction_ and _Major Compaction_. The differences of _Minor Compaction_ and _Major Compaction_ are just their threshold value and compaction interval.
+
+# _Threshold_ : Size percentage of an entry log file occupied by those undeleted ledgers. Default minor compaction threshold is 0.2, while major compaction threshold is 0.8.
+# _Interval_ : How long to run the compaction. Default minor compaction is 1 hour, while major compaction threshold is 1 day.
+
+p. NOTE: if either _Threshold_ or _Interval_ is set to less than or equal to zero, then compaction is disabled.
+
+p. The data compaction flow in _Garbage Collector Thread_ is as follows:
+
+# _Garbage Collector_ thread scans entry log files to get their entry log metadata, which records a list of ledgers comprising an entry log and their corresponding percentages.
+# With the normal garbage collection flow, once the bookie determines that a ledger has been deleted, the ledger will be removed from the entry log metadata and the size of the entry log reduced.
+# If the remaining size of an entry log file reaches a specified threshold, the entries of active ledgers in the entry log will be copied to a new entry log file.
+# Once all valid entries have been copied, the old entry log file is deleted.

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProgrammer.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProgrammer.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProgrammer.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProgrammer.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,99 @@
+Title:        BookKeeper Getting Started Guide
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h1. Abstract
+
+This guide contains detailed information about using BookKeeper for write ahead logging. It discusses the basic operations BookKeeper supports, and how to create logs and perform basic read and write operations on these logs. The main classes used by BookKeeper client are "BookKeeper":./apidocs/org/apache/bookkeeper/client/BookKeeper.html and "LedgerHandle":./apidocs/org/apache/bookkeeper/client/LedgerHandle.html. 
+
+BookKeeper is the main client used to create, open and delete ledgers. A ledger is a log file in BookKeeper, which contains a sequence of entries. Only the client which creates a ledger can write to it. A LedgerHandle represents the ledger to the client, and allows the client to read and write entries. When the client is finished writing they can close the LedgerHandle. Once a ledger has been closed, all client who read from it are guaranteed to read the exact same entries in the exact same order. All methods of BookKeeper and LedgerHandle have synchronous and asynchronous versions. Internally the synchronous versions are implemented using the asynchronous.
+
+h1.  Instantiating BookKeeper
+
+To create a BookKeeper client, you need to create a configuration object and set the address of the ZooKeeper ensemble in use. For example, if you were using @zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181@ as your ensemble, you would create the BookKeeper client as follows.
+
+<pre><code>
+ClientConfiguration conf = new ClientConfiguration();
+conf.setZkServers("zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181"); 
+
+BookKeeper client = new BookKeeper(conf);
+</code></pre>
+
+It is important to close the client once you are finished working with it. The set calls on ClientConfiguration are chainable, so instead of putting a set* call on a new line as above, it is possible to make a number of calls on the one line. For example;
+
+<pre><code>
+ClientConfiguration conf = new ClientConfiguration().setZkServers("localhost:2181").setZkTimeout(5000);
+</code></pre>
+
+There is also a useful shortcut constructor which allows you to pass the zookeeper ensemble string directly to BookKeeper.
+<pre><code>
+BookKeeper client = new BookKeeper("localhost:2181");
+</code></pre>
+
+See "BookKeeper":./apidocs/org/apache/bookkeeper/client/BookKeeper.html for the full api.
+
+
+h1.  Creating a ledger
+
+p. Before writing entries to BookKeeper, it is necessary to create a ledger. Before creating the ledger you must decide the ensemble size and the quorum size. 
+
+p. The ensemble size is the number of Bookies over which entries will be striped. The quorum size is the number of bookies which an entry will be written to. Striping is done in a round robin fashion. For example, if you have an ensemble size of 3 (consisting of bk1, bk2 & bk3), and a quorum of 2, entry 1 will be written to bk1 & bk2, entry 2 will be written to bk2 & bk3, entry 3 will be written to bk3 & bk1 and so on.
+
+p. Ledgers are also created with a digest type and password. The digest type is used to generate a checksum so that when reading entries we can ensure that the content is the same as what was written. The password is used as an access control mechanism.
+
+p. To create a ledger, with ensemble size 3, quorum size 2, using a CRC to checksum and "foobar" as the password, do the following:
+
+<pre><code>
+LedgerHandle lh = client.createLedger(3, 2, DigestType.CRC32, "foobar");
+</code></pre>
+
+You can now write to this ledger handle. As you probably plan to read the ledger at some stage, now is a good time to store the id of the ledger somewhere. The ledger id is a long, and can be obtained with @lh.getId()@.
+
+h1.  Adding entries to a ledger
+
+p. Once you have obtained a ledger handle, you can start adding entries to it. Entries are simply arrays of bytes. As such, adding entries to the ledger is rather simple.
+
+<pre><code>
+lh.addEntry("Hello World!".getBytes());
+</code></pre>
+
+h1.  Closing a ledger
+
+p. Once a client is done writing, it can closes the ledger. Closing the ledger is a very important step in BookKeeper, as once a ledger is closed, all reading clients are guaranteed to read the same sequence of entries in the same order. Closing takes no parameters. 
+
+<pre><code>
+lh.close();
+</code></pre>
+
+h1.  Opening a ledger
+
+To read from a ledger, a client must open it first. To open a ledger you must know its ID, which digest type was used when creating it, and its password. To open the ledger we created above, assuming it has ID 1;
+
+<pre><code>
+LedgerHandle lh2 = client.openLedger(1, DigestType.CRC32, "foobar");
+</code></pre>
+
+You can now read entries from the ledger. Any attempt to write to this handle will throw an exception.
+
+bq. NOTE: Opening a ledger, which another client already has open for writing will prevent that client from writing any new entries to it. If you do not wish this to happen, you should use the openLedgerNoRecovery method. However, keep in mind that without recovery, you lose the guarantees of what entries are in the ledger. You should only use openLedgerNoRecovery if you know what you are doing.
+
+h1. Reading entries from a ledger
+
+p. Now that you have an open ledger, you can read entries from it. You can use @getLastAddConfirmed@ to get the id of the last entry in the ledger.
+
+<pre><code>
+long lastEntry = lh2.getLastAddConfirmed();
+Enumeration<LedgerEntry> entries = lh2.readEntries(0, 9);
+while (entries.hasMoreElements()) {
+	byte[] bytes = entries.nextElement().getEntry();
+	System.out.println(new String(bytes));
+}
+</code></pre>
\ No newline at end of file

Added: bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProtocol.textile
URL: http://svn.apache.org/viewvc/bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProtocol.textile?rev=1743979&view=auto
==============================================================================
--- bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProtocol.textile (added)
+++ bookkeeper/site/trunk/content/docs/r4.4.0/bookkeeperProtocol.textile Sun May 15 21:38:37 2016
@@ -0,0 +1,115 @@
+Title:     BookKeeper Replication Protocol
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+This documents describes the bookkeeper replication protocol, and the guarantees it gives. It assumes you have a general idea about leader election and log replication and how you can use these in your system. If not, have a look at the bookkeeper "tutorial":./bookkeeperTutorial.html first.
+
+h1. Ledgers
+
+A ledger is the basic building block in Bookkeeper. All guarantees we provide are on ledgers. A replicated log is composed of an ordered list of ledgers. See "From Ledgers to Logs":./bookkeeperLedgers2Logs.html on how to build a replicated log from ledgers.
+
+Ledgers are composed of metadata and entries. The metadata is stored in a datastore which provides a compare-and-swap operation (generally ZooKeeper). Entries are stored on storage nodes known as bookies.
+
+A ledger has a single write and multiple readers (SWMR).
+
+A ledger's metadata contains:
+
+* id: a 64bit integer, unique within the system
+* ensemble size (E): the number of nodes the ledger is stored on.
+* write quorum size (Q[~w~]): the number of nodes each entry is written. In effect, the max replication for an entry.
+* ack quorum size (Q[~a~]): the number of nodes a entry must be acknowledge on. In effect, the min replication for an entry.
+* state: OPEN, CLOSED or IN_RECOVERY
+* last entry: the last entry in the ledger, or NULL if state != CLOSED
+* one or more fragments, which each consist of:
+** First entry of fragment, list of bookies for fragment
+
+When creating the ledger, the following invariant must hold.
+
+   E >= Q[~w~] >= Q[~a~]
+
+h4. Ensembles
+
+When the ledger is created, E bookies are chosen for the entries of that ledger. The bookies are the initial ensemble of the ledger. A ledger can have multiple ensembles, but an entry has only one ensemble. Changes in the ensemble, involves a new fragment being added to the ledger.
+
+Take the following example. In this ledger, with ensemble size of 3, there are two fragments and thus two ensembles, one starting at entry 0, the second at entry 12. The second ensemble differs from the first only by its first element. This could be because bookie1 has failed and therefore had to be replaced.
+
+table(table table-bordered table-hover).
+|_. FirstEntry |_. Bookies  |
+| 0            | B1, B2, B3 |
+| 12           | B4, B2, B3 |
+
+h4. Write Quorums
+
+Each entry in the log is written to Q[~w~] nodes. This is considered the write quorum for that entry. The write quorum is the subsequence of the ensemble, Q[~w~] in length, and starting at the bookie at index (entryid % E).
+
+For example, in a ledger of E = 4, Q[~w~] = 3 & Q[~a~] = 2, with an ensemble consisting of B1, B2, B3 & B4, the write quorums for the first 6 entries will be.
+
+table(table table-bordered table-hover).
+|_. Entry |_. Write quorum |
+| 0       | B1, B2, B3     |
+| 1       | B2, B3, B4     |
+| 2       | B3, B4, B1     |
+| 3       | B4, B1, B2     |
+| 4       | B1, B2, B3     |
+| 5       | B2, B3, B4     |
+
+There are only E distinct write quorums in any ensemble. If Q[~w~] = Q[~a~], then there is only one, as no striping occurs.  
+
+h4. Ack Quorums
+
+The ack quorum for an entry is any subset of the write quorum of size Q[~a~]. If Q[~a~] bookies acknowledge an entry, it means it has been fully replicated.
+
+h4. Guarantees
+
+The system can tolerate Q[~a~] - 1 failures without data loss.
+
+Bookkeeper guarantees that:
+ 1. all updates to a ledger will be read in the same order as they  were written 
+ 2. all clients will read the same sequence of updates from the ledger
+
+h1. Writing to the ledger
+
+When an entry is written to a ledger, it is assigned an entry id the write quorum is calculated. As there is only a single writer, ensuring that entry ids are sequential is trivial. A bookie acknowledges a write once it has been persisted to disk and is therefore durable. Once Q[~a~] bookies from the write quorum acknowledge the write, the write is acknowledged to the client, but only if all entries with lower entry ids in the ledger have already been acknowledged to the client.
+
+The entry written contains the ledger id, the entry id, the last add confirmed and the payload. The last add confirmed is the last entry which had been acknowledged to the client when this entry was written. Sending this with the entry speeds up recovery of the ledger in the case that the writer crashes.
+
+Another client can also read entries in the ledger up as far as the last add confirmed, as we guarantee that all entries thus far have been replicated on Q[~a~] nodes, and therefore all future readers will be able to also read it. However, to read like this, the ledger should be opened with a non-fencing open. Otherwise, it would kill the writer.
+
+If a node fails to acknowledge a write, the writer will create a new ensemble by replacing the failed node in the current ensemble. It creates a new fragment with this ensemble, starting from the first message that has not been acknowledged to the client. Creating the new fragment involves making a CAS write to the metadata. If the CAS write fails, someone else has modified something in the ledger metadata. This concurrent modification could have been caused by recovery or rereplication[1]. We reread the metadata. If the state of the ledger is no longer OPEN, we send an error to the client for any outstanding writes. Otherwise, we try to replace the failed node again.
+
+h1. Closing a ledger as a writer
+
+Closing a ledger is straight forward for a writer. The writer makes a CAS write to the metadata, changing the state to CLOSED, and setting the last entry of the ledger to the last entry which we have acknowledged to the client.
+
+If the CAS write fails, it means someone else has modified the metadata. We reread the metadata, and retry closing as long as the state of the ledger is still OPEN. If the state is IN_RECOVERY we send an error to the client. If the state is CLOSED and the last entry is the same as the last entry we have acknowledged to the client, we complete the close operation successfully. If the last entry is different to what we have acknowledged to the client, we send an error to the client.
+
+h1. Closing a ledger as a reader
+
+A reader can also force a ledger to close. Forcing the ledger to close will prevent any writer from adding new entries to the ledger. This is called *Fencing*. This can occur when a writer has crashed or has become unavailable, and a new writer wants to take over writing to the log. The new writer must ensure that it has seen all updates from the previous writer, and prevent the previous writer from making any new updates before making any updates of its own.
+
+To recover a ledger, we first update the state in the metadata to IN_RECOVERY. We then send a fence message to all the bookies in the last fragment of the ledger. When a bookie receives a fence message for a ledger, the fenced state of the ledger is persisted to disk. Once we receive a response from at least (Q[~w~]-Q[~a~])+1 bookies from each write quorum in the ensemble, the ledger is fenced.
+
+By ensuring we have received a response from at last (Q[~w~]-Q[~a~])+1 bookies in each write quorum, we ensure that, if the old writer is alive and tries to add a new entry there will be no write quorum in which Q[~a~] bookies will accept the write. If the old writer tries to update the ensemble, it will fail on the CAS metadata write, and then see that the ledger is in IN_RECOVERY state, and that it therefore shouldn't try to write to it.
+
+The old writer will be able to write entries to individual bookies (we can't guarantee that the fence message reaches all bookies), but as it will not be able reach ack quorum, it will not be able to send a success response to its client. The client will get a LedgerFenced error instead.
+
+It is important to note that when you get a ledger fenced message for an entry, it doesn't mean that the entry has _not_ been written. It means that the entry may or may not have been written, and this can only be determined after the ledger is recovered. In effect, LedgerFenced should be treated like a timeout.
+
+Once the ledger is fenced, recovery can begin. Recovery means finding the last entry of the ledger and closing the ledger. To find the last entry of the ledger, the client asks all bookies for the highest last add confirmed value they have seen. It waits until it has received a response at least (Q[~w~]-Q[~a~])+1 bookies from each write quorum, and takes the highest response as the entry id to start reading forward from. It then starts reading forward in the ledger, one entry at a time, replicating all entries it sees to the entire write quorum for that entry. Once it can no longer read any more entries, it updates the state in the metadata to CLOSED, and sets the last entry of the ledger to the last entry it wrote. Multiple readers can try to recovery a ledger at the same time, but as the metadata write is CAS, they will all converge on the same last entry of the ledger.
+
+fn1. Rereplication is a subsystem that runs in the background on bookies to ensure that ledgers are fully replicated even if one bookie from their ensemble is down
+




Mime
View raw message