hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bus...@apache.org
Subject [14/15] hbase git commit: HBASE-14066 clean out old docbook docs from branch-1.
Date Tue, 14 Jul 2015 02:49:39 GMT
http://git-wip-us.apache.org/repos/asf/hbase/blob/fdd2692f/src/main/docbkx/book.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
deleted file mode 100644
index fa96e17..0000000
--- a/src/main/docbkx/book.xml
+++ /dev/null
@@ -1,6069 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-/**
- *
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
--->
-<book
-  version="5.0"
-  xmlns="http://docbook.org/ns/docbook"
-  xmlns:xlink="http://www.w3.org/1999/xlink"
-  xmlns:xi="http://www.w3.org/2001/XInclude"
-  xmlns:svg="http://www.w3.org/2000/svg"
-  xmlns:m="http://www.w3.org/1998/Math/MathML"
-  xmlns:html="http://www.w3.org/1999/xhtml"
-  xmlns:db="http://docbook.org/ns/docbook"
-  xml:id="book">
-  <info>
-
-    <title><link
-        xlink:href="http://www.hbase.org"> The Apache HBase&#153; Reference Guide </link></title>
-    <subtitle><link
-        xlink:href="http://www.hbase.org">
-        <inlinemediaobject>
-          <imageobject>
-            <imagedata
-              align="center"
-              valign="left"
-              fileref="hbase_logo.png" />
-          </imageobject>
-        </inlinemediaobject>
-        <inlinemediaobject>
-          <imageobject>
-            <imagedata
-              align="center"
-              valign="right"
-              fileref="jumping-orca_rotated_25percent.png" />
-          </imageobject>
-        </inlinemediaobject>
-      </link>
-    </subtitle>
-    <copyright>
-      <year>2014</year>
-      <holder>Apache Software Foundation. All Rights Reserved. Apache Hadoop, Hadoop, MapReduce,
-        HDFS, Zookeeper, HBase, and the HBase project logo are trademarks of the Apache Software
-        Foundation. </holder>
-    </copyright>
-    <abstract>
-      <para>This is the official reference guide of <link
-          xlink:href="http://www.hbase.org">Apache HBase&#153;</link>, a distributed, versioned, big
-        data store built on top of <link
-          xlink:href="http://hadoop.apache.org/">Apache Hadoop&#153;</link> and <link
-          xlink:href="http://zookeeper.apache.org/">Apache ZooKeeper&#153;</link>. </para>
-    </abstract>
-
-    <revhistory>
-      <revision>
-        <revnumber>
-          <?eval ${project.version}?>
-        </revnumber>
-        <date>
-          <?eval ${buildDate}?>
-        </date>
-      </revision>
-    </revhistory>
-  </info>
-
-  <!--XInclude some chapters-->
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="preface.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="getting_started.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="configuration.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="upgrading.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="shell.xml" />
-
-  <chapter
-    xml:id="datamodel">
-    <title>Data Model</title>
-    <para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
-      overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
-    be helpful to think of an HBase table as a multi-dimensional map.</para>
-    <variablelist>
-      <title>HBase Data Model Terminology</title>
-      <varlistentry>
-        <term>Table</term>
-        <listitem>
-          <para>An HBase table consists of multiple rows.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Row</term>
-        <listitem>
-          <para>A row in HBase consists of a row key and one or more columns with values associated
-            with them. Rows are sorted alphabetically by the row key as they are stored. For this
-            reason, the design of the row key is very important. The goal is to store data in such a
-            way that related rows are near each other. A common row key pattern is a website domain.
-            If your row keys are domains, you should probably store them in reverse (org.apache.www,
-            org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
-            other in the table, rather than being spread out based on the first letter of the
-            subdomain.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column</term>
-        <listitem>
-          <para>A column in HBase consists of a column family and a column qualifier, which are
-            delimited by a <literal>:</literal> (colon) character.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column Family</term>
-        <listitem>
-          <para>Column families physically colocate a set of columns and their values, often for
-            performance reasons. Each column family has a set of storage properties, such as whether
-            its values should be cached in memory, how its data is compressed or its row keys are
-            encoded, and others. Each row in a table has the same column
-            families, though a given row might not store anything in a given column family.</para>
-          <para>Column families are specified when you create your table, and influence the way your
-            data is stored in the underlying filesystem. Therefore, the column families should be
-            considered carefully during schema design.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column Qualifier</term>
-        <listitem>
-          <para>A column qualifier is added to a column family to provide the index for a given
-            piece of data. Given a column family <literal>content</literal>, a column qualifier
-            might be <literal>content:html</literal>, and another might be
-            <literal>content:pdf</literal>. Though column families are fixed at table creation,
-            column qualifiers are mutable and may differ greatly between rows.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Cell</term>
-        <listitem>
-          <para>A cell is a combination of row, column family, and column qualifier, and contains a
-            value and a timestamp, which represents the value's version.</para>
-          <para>A cell's value is an uninterpreted array of bytes.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Timestamp</term>
-        <listitem>
-          <para>A timestamp is written alongside each value, and is the identifier for a given
-            version of a value. By default, the timestamp represents the time on the RegionServer
-            when the data was written, but you can specify a different timestamp value when you put
-            data into the cell.</para>
-          <caution>
-            <para>Direct manipulation of timestamps is an advanced feature which is only exposed for
-              special cases that are deeply integrated with HBase, and is discouraged in general.
-              Encoding a timestamp at the application level is the preferred pattern.</para>
-          </caution>
-          <para>You can specify the maximum number of versions of a value that HBase retains, per column
-            family. When the maximum number of versions is reached, the oldest versions are 
-            eventually deleted. By default, only the newest version is kept.</para>
-        </listitem>
-      </varlistentry>
-    </variablelist>
-
-    <section
-      xml:id="conceptual.view">
-      <title>Conceptual View</title>
-      <para>You can read a very understandable explanation of the HBase data model in the blog post <link
-          xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
-          HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
-        PDF <link
-          xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
-          to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
-        perspectives to get a solid understanding of HBase schema design. The linked articles cover
-        the same ground as the information in this section.</para>
-      <para> The following example is a slightly modified form of the one on page 2 of the <link
-          xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
-        is a table called <varname>webtable</varname> that contains two rows
-        (<literal>com.cnn.www</literal>
-          and <literal>com.example.www</literal>), three column families named
-          <varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
-          this example, for the first row (<literal>com.cnn.www</literal>), 
-          <varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
-          <varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
-          (<varname>contents:html</varname>). This example contains 5 versions of the row with the
-        row key <literal>com.cnn.www</literal>, and one version of the row with the row key
-        <literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
-        HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
-        contain the external site which links to the site represented by the row, along with the
-        text it used in the anchor of its link. The <varname>people</varname> column family represents
-        people associated with the site.
-      </para>
-        <note>
-          <title>Column Names</title>
-        <para> By convention, a column name is made of its column family prefix and a
-            <emphasis>qualifier</emphasis>. For example, the column
-            <emphasis>contents:html</emphasis> is made up of the column family
-            <varname>contents</varname> and the <varname>html</varname> qualifier. The colon
-          character (<literal>:</literal>) delimits the column family from the column family
-            <emphasis>qualifier</emphasis>. </para>
-        </note>
-        <table
-          frame="all">
-          <title>Table <varname>webtable</varname></title>
-          <tgroup
-            cols="5"
-            align="left"
-            colsep="1"
-            rowsep="1">
-            <colspec
-              colname="c1" />
-            <colspec
-              colname="c2" />
-            <colspec
-              colname="c3" />
-            <colspec
-              colname="c4" />
-            <colspec
-              colname="c5" />
-            <thead>
-              <row>
-                <entry>Row Key</entry>
-                <entry>Time Stamp</entry>
-                <entry>ColumnFamily <varname>contents</varname></entry>
-                <entry>ColumnFamily <varname>anchor</varname></entry>
-                <entry>ColumnFamily <varname>people</varname></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t9</entry>
-                <entry />
-                <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t8</entry>
-                <entry />
-                <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t6</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t5</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t3</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.example.www"</entry>
-                <entry>t5</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry></entry>
-                <entry>people:author = "John Doe"</entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </table>
-      <para>Cells in this table that appear to be empty do not take space, or in fact exist, in
-        HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
-        look at data in HBase, or even the most accurate. The following represents the same
-        information as a multi-dimensional map. This is only a mock-up for illustrative
-        purposes and may not be strictly accurate.</para>
-      <programlisting><![CDATA[
-{
-	"com.cnn.www": {
-		contents: {
-			t6: contents:html: "<html>..."
-			t5: contents:html: "<html>..."
-			t3: contents:html: "<html>..."
-		}
-		anchor: {
-			t9: anchor:cnnsi.com = "CNN"
-			t8: anchor:my.look.ca = "CNN.com"
-		}
-		people: {}
-	}
-	"com.example.www": {
-		contents: {
-			t5: contents:html: "<html>..."
-		}
-		anchor: {}
-		people: {
-			t5: people:author: "John Doe"
-		}
-	}
-}        
-        ]]></programlisting>
-
-    </section>
-    <section
-      xml:id="physical.view">
-      <title>Physical View</title>
-      <para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
-        physically stored by column family. A new column qualifier (column_family:column_qualifier)
-        can be added to an existing column family at any time.</para>
-      <table
-        frame="all">
-        <title>ColumnFamily <varname>anchor</varname></title>
-        <tgroup
-          cols="3"
-          align="left"
-          colsep="1"
-          rowsep="1">
-          <colspec
-            colname="c1" />
-          <colspec
-            colname="c2" />
-          <colspec
-            colname="c3" />
-          <thead>
-            <row>
-              <entry>Row Key</entry>
-              <entry>Time Stamp</entry>
-              <entry>Column Family <varname>anchor</varname></entry>
-            </row>
-          </thead>
-          <tbody>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t9</entry>
-              <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t8</entry>
-              <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-      <table
-        frame="all">
-        <title>ColumnFamily <varname>contents</varname></title>
-        <tgroup
-          cols="3"
-          align="left"
-          colsep="1"
-          rowsep="1">
-          <colspec
-            colname="c1" />
-          <colspec
-            colname="c2" />
-          <colspec
-            colname="c3" />
-          <thead>
-            <row>
-              <entry>Row Key</entry>
-              <entry>Time Stamp</entry>
-              <entry>ColumnFamily "contents:"</entry>
-            </row>
-          </thead>
-          <tbody>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t6</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t5</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t3</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-      <para>The empty cells shown in the
-        conceptual view are not stored at all.
-        Thus a request for the value of the <varname>contents:html</varname> column at time stamp
-          <literal>t8</literal> would return no value. Similarly, a request for an
-          <varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
-        return no value. However, if no timestamp is supplied, the most recent value for a
-        particular column would be returned. Given multiple versions, the most recent is also the
-        first one found,  since timestamps
-        are stored in descending order. Thus a request for the values of all columns in the row
-          <varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
-          <varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
-          <varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of
-          <varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para>
-      <para>For more information about the internals of how Apache HBase stores data, see <xref
-          linkend="regions.arch" />. </para>
-    </section>
-
-    <section
-      xml:id="namespace">
-      <title>Namespace</title>
-      <para> A namespace is a logical grouping of tables analogous to a database in relation
-        database systems. This abstraction lays the groundwork for upcoming multi-tenancy related
-        features: <itemizedlist>
-          <listitem>
-            <para>Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions,
-              tables) a namespace can consume.</para>
-          </listitem>
-          <listitem>
-            <para>Namespace Security Administration (HBASE-9206) - provide another level of security
-              administration for tenants.</para>
-          </listitem>
-          <listitem>
-            <para>Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset
-              of regionservers thus guaranteeing a course level of isolation.</para>
-          </listitem>
-        </itemizedlist>
-      </para>
-      <section
-        xml:id="namespace_creation">
-        <title>Namespace management</title>
-        <para> A namespace can be created, removed or altered. Namespace membership is determined
-          during table creation by specifying a fully-qualified table name of the form:</para>
-
-        <programlisting language="xml"><![CDATA[<table namespace>:<table qualifier>]]></programlisting>
-
-
-        <example>
-          <title>Examples</title>
-
-          <programlisting language="bourne">
-#Create a namespace
-create_namespace 'my_ns'
-            </programlisting>
-          <programlisting language="bourne">
-#create my_table in my_ns namespace
-create 'my_ns:my_table', 'fam'
-          </programlisting>
-          <programlisting language="bourne">
-#drop namespace
-drop_namespace 'my_ns'
-          </programlisting>
-          <programlisting language="bourne">
-#alter namespace
-alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
-        </programlisting>
-        </example>
-      </section>
-      <section
-        xml:id="namespace_special">
-        <title>Predefined namespaces</title>
-        <para> There are two predefined special namespaces: </para>
-        <itemizedlist>
-          <listitem>
-            <para>hbase - system namespace, used to contain hbase internal tables</para>
-          </listitem>
-          <listitem>
-            <para>default - tables with no explicit specified namespace will automatically fall into
-              this namespace.</para>
-          </listitem>
-        </itemizedlist>
-        <example>
-          <title>Examples</title>
-
-          <programlisting language="bourne">
-#namespace=foo and table qualifier=bar
-create 'foo:bar', 'fam'
-
-#namespace=default and table qualifier=bar
-create 'bar', 'fam'
-</programlisting>
-        </example>
-      </section>
-    </section>
-
-    <section
-      xml:id="table">
-      <title>Table</title>
-      <para> Tables are declared up front at schema definition time. </para>
-    </section>
-
-    <section
-      xml:id="row">
-      <title>Row</title>
-      <para>Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest
-        order appearing first in a table. The empty byte array is used to denote both the start and
-        end of a tables' namespace.</para>
-    </section>
-
-    <section
-      xml:id="columnfamily">
-      <title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
-      <para> Columns in Apache HBase are grouped into <emphasis>column families</emphasis>. All
-        column members of a column family have the same prefix. For example, the columns
-          <emphasis>courses:history</emphasis> and <emphasis>courses:math</emphasis> are both
-        members of the <emphasis>courses</emphasis> column family. The colon character
-          (<literal>:</literal>) delimits the column family from the <indexterm><primary>column
-            family qualifier</primary><secondary>Column Family Qualifier</secondary></indexterm>.
-        The column family prefix must be composed of <emphasis>printable</emphasis> characters. The
-        qualifying tail, the column family <emphasis>qualifier</emphasis>, can be made of any
-        arbitrary bytes. Column families must be declared up front at schema definition time whereas
-        columns do not need to be defined at schema time but can be conjured on the fly while the
-        table is up an running.</para>
-      <para>Physically, all column family members are stored together on the filesystem. Because
-        tunings and storage specifications are done at the column family level, it is advised that
-        all column family members have the same general access pattern and size
-        characteristics.</para>
-
-    </section>
-    <section
-      xml:id="cells">
-      <title>Cells<indexterm><primary>Cells</primary></indexterm></title>
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
-          <literal>cell</literal> in HBase. Cell content is uninterrpreted bytes</para>
-    </section>
-    <section
-      xml:id="data_model_operations">
-      <title>Data Model Operations</title>
-      <para>The four primary data model operations are Get, Put, Scan, and Delete. Operations are
-        applied via <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link>
-        instances.
-      </para>
-      <section
-        xml:id="get">
-        <title>Get</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
-          returns attributes for a specified row. Gets are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)">
-            Table.get</link>. </para>
-      </section>
-      <section
-        xml:id="put">
-        <title>Put</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link>
-          either adds new rows to a table (if the key is new) or can update existing rows (if the
-          key already exists). Puts are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)">
-            Table.put</link> (writeBuffer) or <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List, java.lang.Object[])">
-            Table.batch</link> (non-writeBuffer). </para>
-      </section>
-      <section
-        xml:id="scan">
-        <title>Scans</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
-          allow iteration over multiple rows for specified attributes. </para>
-        <para>The following is an example of a Scan on a Table instance. Assume that a table is
-          populated with rows with keys "row1", "row2", "row3", and then another set of rows with
-          the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan
-          instance to return the rows beginning with "row".</para>
-<programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-
-Table table = ...      // instantiate a Table instance
-
-Scan scan = new Scan();
-scan.addColumn(CF, ATTR);
-scan.setRowPrefixFilter(Bytes.toBytes("row"));
-ResultScanner rs = table.getScanner(scan);
-try {
-  for (Result r = rs.next(); r != null; r = rs.next()) {
-  // process result...
-} finally {
-  rs.close();  // always close the ResultScanner!
-</programlisting>
-        <para>Note that generally the easiest way to specify a specific stop point for a scan is by
-          using the <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html">InclusiveStopFilter</link>
-          class. </para>
-      </section>
-      <section
-        xml:id="delete">
-        <title>Delete</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html">Delete</link>
-          removes a row from a table. Deletes are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)">
-            HTable.delete</link>. </para>
-        <para>HBase does not modify data in place, and so deletes are handled by creating new
-          markers called <emphasis>tombstones</emphasis>. These tombstones, along with the dead
-          values, are cleaned up on major compactions. </para>
-        <para>See <xref
-            linkend="version.delete" /> for more information on deleting versions of columns, and
-          see <xref
-            linkend="compaction" /> for more information on compactions. </para>
-
-      </section>
-
-    </section>
-
-
-    <section
-      xml:id="versions">
-      <title>Versions<indexterm><primary>Versions</primary></indexterm></title>
-
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
-          <literal>cell</literal> in HBase. It's possible to have an unbounded number of cells where
-        the row and column are the same but the cell address differs only in its version
-        dimension.</para>
-
-      <para>While rows and column keys are expressed as bytes, the version is specified using a long
-        integer. Typically this long contains time instances such as those returned by
-          <code>java.util.Date.getTime()</code> or <code>System.currentTimeMillis()</code>, that is:
-          <quote>the difference, measured in milliseconds, between the current time and midnight,
-          January 1, 1970 UTC</quote>.</para>
-
-      <para>The HBase version dimension is stored in decreasing order, so that when reading from a
-        store file, the most recent values are found first.</para>
-
-      <para>There is a lot of confusion over the semantics of <literal>cell</literal> versions, in
-        HBase. In particular:</para>
-      <itemizedlist>
-        <listitem>
-          <para>If multiple writes to a cell have the same version, only the last written is
-            fetchable.</para>
-        </listitem>
-
-        <listitem>
-          <para>It is OK to write cells in a non-increasing version order.</para>
-        </listitem>
-      </itemizedlist>
-
-      <para>Below we describe how the version dimension in HBase currently works. See <link
-              xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link> for
-            discussion of HBase versions. <link
-              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in HBase</link>
-            makes for a good read on the version, or time, dimension in HBase. It has more detail on
-            versioning than is provided here. As of this writing, the limiitation
-              <emphasis>Overwriting values at existing timestamps</emphasis> mentioned in the
-            article no longer holds in HBase. This section is basically a synopsis of this article
-            by Bruno Dumon.</para>
-      
-      <section xml:id="specify.number.of.versions">
-        <title>Specifying the Number of Versions to Store</title>
-        <para>The maximum number of versions to store for a given column is part of the column
-          schema and is specified at table creation, or via an <command>alter</command> command, via
-            <code>HColumnDescriptor.DEFAULT_VERSIONS</code>. Prior to HBase 0.96, the default number
-          of versions kept was <literal>3</literal>, but in 0.96 and newer has been changed to
-            <literal>1</literal>.</para>
-        <example>
-          <title>Modify the Maximum Number of Versions for a Column</title>
-          <para>This example uses HBase Shell to keep a maximum of 5 versions of column
-              <code>f1</code>. You could also use <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
-              >HColumnDescriptor</link>.</para>
-          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]></screen>
-        </example>
-        <example>
-          <title>Modify the Minimum Number of Versions for a Column</title>
-          <para>You can also specify the minimum number of versions to store. By default, this is
-            set to 0, which means the feature is disabled. The following example sets the minimum
-            number of versions on field <code>f1</code> to <literal>2</literal>, via HBase Shell.
-            You could also use <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
-              >HColumnDescriptor</link>.</para>
-          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]></screen>
-        </example>
-        <para>Starting with HBase 0.98.2, you can specify a global default for the maximum number of
-          versions kept for all newly-created columns, by setting
-            <option>hbase.column.max.version</option> in <filename>hbase-site.xml</filename>. See
-            <xref linkend="hbase.column.max.version"/>.</para>
-      </section>
-
-      <section
-        xml:id="versions.ops">
-        <title>Versions and HBase Operations</title>
-
-        <para>In this section we look at the behavior of the version dimension for each of the core
-          HBase operations.</para>
-
-        <section>
-          <title>Get/Scan</title>
-
-          <para>Gets are implemented on top of Scans. The below discussion of <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
-            applies equally to <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scans</link>.</para>
-
-          <para>By default, i.e. if you specify no explicit version, when doing a
-              <literal>get</literal>, the cell whose version has the largest value is returned
-            (which may or may not be the latest one written, see later). The default behavior can be
-            modified in the following ways:</para>
-
-          <itemizedlist>
-            <listitem>
-              <para>to return more than one version, see <link
-                  xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()">Get.setMaxVersions()</link></para>
-            </listitem>
-
-            <listitem>
-              <para>to return versions other than the latest, see <link
-                  xlink:href="???">Get.setTimeRange()</link></para>
-
-              <para>To retrieve the latest version that is less than or equal to a given value, thus
-                giving the 'latest' state of the record at a certain point in time, just use a range
-                from 0 to the desired version and set the max versions to 1.</para>
-            </listitem>
-          </itemizedlist>
-
-        </section>
-        <section
-          xml:id="default_get_example">
-          <title>Default Get Example</title>
-          <para>The following Get will only retrieve the current version of the row</para>
-          <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Get get = new Get(Bytes.toBytes("row1"));
-Result r = table.get(get);
-byte[] b = r.getValue(CF, ATTR);  // returns current version of value
-</programlisting>
-        </section>
-        <section
-          xml:id="versioned_get_example">
-          <title>Versioned Get Example</title>
-          <para>The following Get will return the last 3 versions of the row.</para>
-          <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Get get = new Get(Bytes.toBytes("row1"));
-get.setMaxVersions(3);  // will return last 3 versions of row
-Result r = table.get(get);
-byte[] b = r.getValue(CF, ATTR);  // returns current version of value
-List&lt;KeyValue&gt; kv = r.getColumn(CF, ATTR);  // returns all versions of this column
-</programlisting>
-        </section>
-
-        <section>
-          <title>Put</title>
-
-          <para>Doing a put always creates a new version of a <literal>cell</literal>, at a certain
-            timestamp. By default the system uses the server's <literal>currentTimeMillis</literal>,
-            but you can specify the version (= the long integer) yourself, on a per-column level.
-            This means you could assign a time in the past or the future, or use the long value for
-            non-time purposes.</para>
-
-          <para>To overwrite an existing value, do a put at exactly the same row, column, and
-            version as that of the cell you would overshadow.</para>
-          <section
-            xml:id="implicit_version_example">
-            <title>Implicit Version Example</title>
-            <para>The following Put will be implicitly versioned by HBase with the current
-              time.</para>
-            <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Put put = new Put(Bytes.toBytes(row));
-put.add(CF, ATTR, Bytes.toBytes( data));
-table.put(put);
-</programlisting>
-          </section>
-          <section
-            xml:id="explicit_version_example">
-            <title>Explicit Version Example</title>
-            <para>The following Put has the version timestamp explicitly set.</para>
-            <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Put put = new Put( Bytes.toBytes(row));
-long explicitTimeInMs = 555;  // just an example
-put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));
-table.put(put);
-</programlisting>
-            <para>Caution: the version timestamp is internally by HBase for things like time-to-live
-              calculations. It's usually best to avoid setting this timestamp yourself. Prefer using
-              a separate timestamp attribute of the row, or have the timestamp a part of the rowkey,
-              or both. </para>
-          </section>
-
-        </section>
-
-        <section
-          xml:id="version.delete">
-          <title>Delete</title>
-
-          <para>There are three different types of internal delete markers. See Lars Hofhansl's blog
-            for discussion of his attempt adding another, <link
-              xlink:href="http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html">Scanning
-              in HBase: Prefix Delete Marker</link>. </para>
-          <itemizedlist>
-            <listitem>
-              <para>Delete: for a specific version of a column.</para>
-            </listitem>
-            <listitem>
-              <para>Delete column: for all versions of a column.</para>
-            </listitem>
-            <listitem>
-              <para>Delete family: for all columns of a particular ColumnFamily</para>
-            </listitem>
-          </itemizedlist>
-          <para>When deleting an entire row, HBase will internally create a tombstone for each
-            ColumnFamily (i.e., not each individual column). </para>
-          <para>Deletes work by creating <emphasis>tombstone</emphasis> markers. For example, let's
-            suppose we want to delete a row. For this you can specify a version, or else by default
-            the <literal>currentTimeMillis</literal> is used. What this means is <quote>delete all
-              cells where the version is less than or equal to this version</quote>. HBase never
-            modifies data in place, so for example a delete will not immediately delete (or mark as
-            deleted) the entries in the storage file that correspond to the delete condition.
-            Rather, a so-called <emphasis>tombstone</emphasis> is written, which will mask the
-            deleted values. When HBase does a major compaction, the tombstones are processed to
-            actually remove the dead values, together with the tombstones themselves. If the version
-            you specified when deleting a row is larger than the version of any value in the row,
-            then you can consider the complete row to be deleted.</para>
-          <para>For an informative discussion on how deletes and versioning interact, see the thread <link
-              xlink:href="http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28421">Put w/
-              timestamp -> Deleteall -> Put w/ timestamp fails</link> up on the user mailing
-            list.</para>
-          <para>Also see <xref
-              linkend="keyvalue" /> for more information on the internal KeyValue format. </para>
-          <para>Delete markers are purged during the next major compaction of the store, unless the
-              <option>KEEP_DELETED_CELLS</option> option is set in the column family. To keep the
-            deletes for a configurable amount of time, you can set the delete TTL via the
-              <option>hbase.hstore.time.to.purge.deletes</option> property in
-              <filename>hbase-site.xml</filename>. If
-              <option>hbase.hstore.time.to.purge.deletes</option> is not set, or set to 0, all
-            delete markers, including those with timestamps in the future, are purged during the
-            next major compaction. Otherwise, a delete marker with a timestamp in the future is kept
-            until the major compaction which occurs after the time represented by the marker's
-            timestamp plus the value of <option>hbase.hstore.time.to.purge.deletes</option>, in
-            milliseconds. </para>
-          <note>
-            <para>This behavior represents a fix for an unexpected change that was introduced in
-              HBase 0.94, and was fixed in <link
-                xlink:href="https://issues.apache.org/jira/browse/HBASE-10118">HBASE-10118</link>.
-              The change has been backported to HBase 0.94 and newer branches.</para>
-          </note>
-        </section>
-      </section>
-
-      <section>
-        <title>Current Limitations</title>
-
-        <section>
-          <title>Deletes mask Puts</title>
-
-          <para>Deletes mask puts, even puts that happened after the delete
-          was entered. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-2256"
-              >HBASE-2256</link>. Remember that a delete writes a tombstone, which only
-          disappears after then next major compaction has run. Suppose you do
-          a delete of everything &lt;= T. After this you do a new put with a
-          timestamp &lt;= T. This put, even if it happened after the delete,
-          will be masked by the delete tombstone. Performing the put will not
-          fail, but when you do a get you will notice the put did have no
-          effect. It will start working again after the major compaction has
-          run. These issues should not be a problem if you use
-          always-increasing versions for new puts to a row. But they can occur
-          even if you do not care about time: just do delete and put
-          immediately after each other, and there is some chance they happen
-          within the same millisecond.</para>
-        </section>
-
-        <section
-          xml:id="major.compactions.change.query.results">
-          <title>Major compactions change query results</title>
-          
-          <para><quote>...create three cell versions at t1, t2 and t3, with a maximum-versions
-              setting of 2. So when getting all versions, only the values at t2 and t3 will be
-              returned. But if you delete the version at t2 or t3, the one at t1 will appear again.
-              Obviously, once a major compaction has run, such behavior will not be the case
-              anymore...</quote> (See <emphasis>Garbage Collection</emphasis> in <link
-              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in
-            HBase</link>.)</para>
-        </section>
-      </section>
-    </section>
-    <section xml:id="dm.sort">
-      <title>Sort Order</title>
-      <para>All data model operations HBase return data in sorted order.  First by row,
-      then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted
-      in reverse, so newest records are returned first).
-      </para>
-    </section>
-    <section xml:id="dm.column.metadata">
-      <title>Column Metadata</title>
-      <para>There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily.
-      Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns
-      between rows as well, it is your responsibility to keep track of the column names.
-      </para>
-      <para>The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows.
-      For more information about how HBase stores data internally, see <xref linkend="keyvalue" />.
-	  </para>
-    </section>
-    <section xml:id="joins"><title>Joins</title>
-      <para>Whether HBase supports joins is a common question on the dist-list, and there is a simple answer:  it doesn't,
-      at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL).  As has been illustrated
-      in this chapter, the read data model operations in HBase are Get and Scan.
-      </para>
-      <para>However, that doesn't mean that equivalent join functionality can't be supported in your application, but
-      you have to do it yourself.  The two primary strategies are either denormalizing the data upon writing to HBase,
-      or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS'
-      demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs.
-      hash-joins).  So which is the best approach?  It depends on what you are trying to do, and as such there isn't a single
-      answer that works for every use case.
-      </para>
-    </section>
-    <section xml:id="acid"><title>ACID</title>
-        <para>See <link xlink:href="http://hbase.apache.org/acid-semantics.html">ACID Semantics</link>.
-            Lars Hofhansl has also written a note on
-            <link xlink:href="http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html">ACID in HBase</link>.</para>
-    </section>
-  </chapter>  <!-- data model -->
-
-  <!--  schema design -->
-  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
-
-  <chapter
-    xml:id="mapreduce">
-    <title>HBase and MapReduce</title>
-    <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
-      the framework used most often with <link
-        xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
-      scope of this document. A good place to get started with MapReduce is <link
-        xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
-      2 (MR2)is now part of <link
-        xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
-
-    <para> This chapter discusses specific configuration steps you need to take to use MapReduce on
-      data within HBase. In addition, it discusses other interactions and issues between HBase and
-      MapReduce jobs.
-      <note> 
-      <title>mapred and mapreduce</title>
-      <para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
-      and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
-      the new style.  The latter has more facility though you can usually find an equivalent in the older
-      package.  Pick the package that goes with your mapreduce deploy.  When in doubt or starting over, pick the
-      <filename>org.apache.hadoop.hbase.mapreduce</filename>.  In the notes below, we refer to
-      o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
-      </para>
-      </note> 
-    </para>
-
-    <section
-      xml:id="hbase.mapreduce.classpath">
-      <title>HBase, MapReduce, and the CLASSPATH</title>
-      <para>Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
-        the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
-      <para>To give the MapReduce jobs the access they need, you could add
-          <filename>hbase-site.xml</filename> to the
-            <filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
-        HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
-        directory, then copy these changes across your cluster. You could add hbase-site.xml to
-        $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
-        these changes across your cluster or edit
-          <filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
-        them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
-        recommended because it will pollute your Hadoop install with HBase references. It also
-        requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
-      <para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
-        dependencies only need to be available on the local CLASSPATH. The following example runs
-        the bundled HBase <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
-        the environment variables expected in the command (the parts prefixed by a
-          <literal>$</literal> sign and curly braces), you can use the actual system paths instead.
-        Be sure to use the correct version of the HBase JAR for your system. The backticks
-          (<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
-        CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
-      <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable</userinput></screen>
-      <para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
-        zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
-        and adds the JARs to the MapReduce job configuration. See the source at
-        TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
-      <note>
-        <para> The example may not work if you are running HBase from its build directory rather
-          than an installed location. You may see an error like the following:</para>
-        <screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
-        <para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
-          from the <filename>target/</filename> directory within the build environment.</para>
-        <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable</userinput></screen>
-      </note>
-      <caution>
-        <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
-        <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
-          to the following:</para>
-        <screen>
-Exception in thread "main" java.lang.IllegalAccessError: class
-    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
-    com.google.protobuf.LiteralByteString
-    at java.lang.ClassLoader.defineClass1(Native Method)
-    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
-    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
-    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
-    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
-    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
-    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
-    at java.security.AccessController.doPrivileged(Native Method)
-    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
-    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
-    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
-    at
-    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
-...
-</screen>
-        <para>This is caused by an optimization introduced in <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
-          inadvertently introduced a classloader dependency. </para>
-        <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
-          which package their runtime dependencies in a nested <code>lib</code> folder.</para>
-        <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
-          included in Hadoop's classpath. See <xref
-            linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
-          classpath errors. The following is included for historical purposes.</para>
-        <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
-          hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
-        <para>This can also be achieved on a per-job launch basis by including it in the
-            <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
-          launching jobs that package their dependencies, all three of the following job launching
-          commands satisfy this requirement:</para>
-        <screen language="bourne">
-$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
-        </screen>
-        <para>For jars that do not package their dependencies, the following command structure is
-          necessary:</para>
-        <screen language="bourne">
-$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
-        </screen>
-        <para>See also <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
-          further discussion of this issue.</para>
-      </caution>
-    </section>
-
-    <section>
-      <title>MapReduce Scan Caching</title>
-      <para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows
-        which are cached before returning the result to the client) on the Scan object that is
-        passed in. This functionality was lost due to a bug in HBase 0.95 (<link
-          xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which
-        is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is
-        as follows:</para>
-      <orderedlist>
-        <listitem>
-          <para>Caching settings which are set on the scan object.</para>
-        </listitem>
-        <listitem>
-          <para>Caching settings which are specified via the configuration option
-              <option>hbase.client.scanner.caching</option>, which can either be set manually in
-              <filename>hbase-site.xml</filename> or via the helper method
-              <code>TableMapReduceUtil.setScannerCaching()</code>.</para>
-        </listitem>
-        <listitem>
-          <para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to
-            <literal>100</literal>.</para>
-        </listitem>
-      </orderedlist>
-      <para>Optimizing the caching settings is a balance between the time the client waits for a
-        result and the number of sets of results the client needs to receive. If the caching setting
-        is too large, the client could end up waiting for a long time or the request could even time
-        out. If the setting is too small, the scan needs to return results in several pieces.
-        If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger
-        shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the
-        bucket.</para>
-      <para>The list of priorities mentioned above allows you to set a reasonable default, and
-        override it for specific operations.</para>
-      <para>See the API documentation for <link
-          xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
-          >Scan</link> for more details.</para>
-    </section>
-
-    <section>
-      <title>Bundled HBase MapReduce Jobs</title>
-      <para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
-        the bundled MapReduce jobs, run the following command.</para>
-
-      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar</userinput>
-<computeroutput>An example program must be given as the first argument.
-Valid program names are:
-  copytable: Export a table from local cluster to peer cluster
-  completebulkload: Complete a bulk data load.
-  export: Write table data to HDFS.
-  import: Import data written by Export.
-  importtsv: Import data in TSV format.
-  rowcounter: Count rows in HBase table</computeroutput>
-    </screen>
-      <para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
-        model your command after the following example.</para>
-      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable</userinput></screen>
-    </section>
-
-    <section>
-      <title>HBase as a MapReduce Job Data Source and Data Sink</title>
-      <para>HBase can be used as a data source, <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
-        and data sink, <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-        or <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
-        for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
-        subclass <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
-        and/or <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
-        See the do-nothing pass-through classes <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
-        and <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
-        for basic usage. For a more involved example, see <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
-      <para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
-        sink table and column names in your configuration.</para>
-
-      <para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
-        from HBase and makes a map, which is either a <code>map-per-region</code> or
-          <code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
-        raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
-        will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
-        node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
-        HBase from within your map. This approach works when your job does not need the sort and
-        collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
-        no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
-        to. If you do not need the Reduce, you myour map might emit counts of records processed for
-        reporting at the end of the jobj, or set the number of Reduces to zero and use
-        TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
-        use multiple reducers so that load is spread across the HBase cluster.</para>
-
-      <para>A new HBase partitioner, the <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
-        can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
-        when your table is large and your upload will not greatly alter the number of existing
-        regions upon completion. Otherwise use the default partitioner. </para>
-    </section>
-
-    <section>
-      <title>Writing HFiles Directly During Bulk Import</title>
-      <para>If you are importing into a new table, you can bypass the HBase API and write your
-        content directly to the filesystem, formatted into HBase data files (HFiles). Your import
-        will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
-        see <xref
-          linkend="arch.bulk.load" />.</para>
-    </section>
-
-    <section>
-      <title>RowCounter Example</title>
-      <para>The included <link
-        xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
-        table. To run it, use the following command: </para>
-      <screen language="bourne">$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen> 
-      <para>This will
-        invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
-        offered. This will print rowcouner usage advice to standard output. Specify the tablename,
-        column to count, and output
-        directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
-    </section>
-
-    <section
-      xml:id="splitter">
-      <title>Map-Task Splitting</title>
-      <section
-        xml:id="splitter.default">
-        <title>The Default HBase MapReduce Splitter</title>
-        <para>When <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
-          is used to source an HBase table in a MapReduce job, its splitter will make a map task for
-          each region of the table. Thus, if there are 100 regions in the table, there will be 100
-          map-tasks for the job - regardless of how many column families are selected in the
-          Scan.</para>
-      </section>
-      <section
-        xml:id="splitter.custom">
-        <title>Custom Splitters</title>
-        <para>For those interested in implementing custom splitters, see the method
-            <code>getSplits</code> in <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
-          That is where the logic for map-task assignment resides. </para>
-      </section>
-    </section>
-    <section
-      xml:id="mapreduce.example">
-      <title>HBase MapReduce Examples</title>
-      <section
-        xml:id="mapreduce.example.read">
-        <title>HBase MapReduce Read Example</title>
-        <para>The following is an example of using HBase as a MapReduce source in read-only manner.
-          Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
-          the Mapper. There job would be defined as follows...</para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config, "ExampleRead");
-job.setJarByClass(MyReadJob.class);     // class that contains mapper
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-...
-
-TableMapReduceUtil.initTableMapperJob(
-  tableName,        // input HBase table name
-  scan,             // Scan instance to control CF and attribute selection
-  MyMapper.class,   // mapper
-  null,             // mapper output key
-  null,             // mapper output value
-  job);
-job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-  throw new IOException("error with job!");
-}
-  </programlisting>
-        <para>...and the mapper instance would extend <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
-        <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
-
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
-    // process data for the row from the Result instance.
-   }
-}
-    </programlisting>
-      </section>
-      <section
-        xml:id="mapreduce.example.readwrite">
-        <title>HBase MapReduce Read/Write Example</title>
-        <para>The following is an example of using HBase both as a source and as a sink with
-          MapReduce. This example will simply copy data from one table to another.</para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleReadWrite");
-job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,      // input table
-	scan,	          // Scan instance to control CF and attribute selection
-	MyMapper.class,   // mapper class
-	null,	          // mapper output key
-	null,	          // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,      // output table
-	null,             // reducer class
-	job);
-job.setNumReduceTasks(0);
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-    throw new IOException("error with job!");
-}
-    </programlisting>
-        <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
-          especially with the reducer. <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-          is being used as the outputFormat class, and several parameters are being set on the
-          config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
-          to <classname>ImmutableBytesWritable</classname> and reducer value to
-            <classname>Writable</classname>. These could be set by the programmer on the job and
-          conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
-        <para>The following is the example mapper, which will create a <classname>Put</classname>
-          and matching the input <classname>Result</classname> and emit it. Note: this is what the
-          CopyTable utility does. </para>
-        <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt;  {
-
-	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-		// this example is just copying the data from the source table...
-   		context.write(row, resultToPut(row,value));
-   	}
-
-  	private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
-  		Put put = new Put(key.get());
- 		for (KeyValue kv : result.raw()) {
-			put.add(kv);
-		}
-		return put;
-   	}
-}
-    </programlisting>
-        <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
-          care of sending the <classname>Put</classname> to the target table. </para>
-        <para>This is just an example, developers could choose not to use
-            <classname>TableOutputFormat</classname> and connect to the target table themselves.
-        </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.readwrite.multi">
-        <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
-        <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary">
-        <title>HBase MapReduce Summary to HBase Example</title>
-        <para>The following example uses HBase as a MapReduce source and sink with a summarization
-          step. This example will count the number of distinct instances of a value in a table and
-          write those summarized counts in another table.
-          <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleSummary");
-job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,        // output table
-	MyTableReducer.class,    // reducer class
-	job);
-job.setNumReduceTasks(1);   // at least one, adjust as required
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-	throw new IOException("error with job!");
-}
-    </programlisting>
-          In this example mapper a column with a String-value is chosen as the value to summarize
-          upon. This value is used as the key to emit from the mapper, and an
-            <classname>IntWritable</classname> represents an instance counter.
-          <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] ATTR1 = "attr1".getBytes();
-
-	private final IntWritable ONE = new IntWritable(1);
-   	private Text text = new Text();
-
-   	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-        	String val = new String(value.getValue(CF, ATTR1));
-          	text.set(val);     // we can only emit Writables...
-
-        	context.write(text, ONE);
-   	}
-}
-    </programlisting>
-          In the reducer, the "ones" are counted (just like any other MR example that does this),
-          and then emits a <classname>Put</classname>.
-          <programlisting language="java">
-public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt;  {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] COUNT = "count".getBytes();
-
- 	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-    		int i = 0;
-    		for (IntWritable val : values) {
-    			i += val.get();
-    		}
-    		Put put = new Put(Bytes.toBytes(key.toString()));
-    		put.add(CF, COUNT, Bytes.toBytes(i));
-
-    		context.write(null, put);
-   	}
-}
-    </programlisting>
-        </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.file">
-        <title>HBase MapReduce Summary to File Example</title>
-        <para>This very similar to the summary example above, with exception that this is using
-          HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
-          in the reducer. The mapper remains the same. </para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleSummaryToFile");
-job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducer
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
-job.setReducerClass(MyReducer.class);    // reducer class
-job.setNumReduceTasks(1);    // at least one, adjust as required
-FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories as required
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-	throw new IOException("error with job!");
-}
-    </programlisting>
-        <para>As stated above, the previous Mapper can run unchanged with this example. As for the
-          Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
-          Puts.</para>
-        <programlisting language="java">
- public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
-
-	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-		int i = 0;
-		for (IntWritable val : values) {
-			i += val.get();
-		}
-		context.write(key, new IntWritable(i));
-	}
-}
-    </programlisting>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.noreducer">
-        <title>HBase MapReduce Summary to HBase Without Reducer</title>
-        <para>It is also possible to perform summaries without a reducer - if you use HBase as the
-          reducer. </para>
-        <para>An HBase target table would need to exist for the job summary. The Table method
-            <code>incrementColumnValue</code> would be used to atomically increment values. From a
-          performance perspective, it might make sense to keep a Map of values with their values to
-          be incremeneted for each map-task, and make one update per key at during the <code>
-            cleanup</code> method of the mapper. However, your milage may vary depending on the
-          number of rows to be processed and unique keys. </para>
-        <para>In the end, the summary results are in HBase. </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.rdbms">
-        <title>HBase MapReduce Summary to RDBMS</title>
-        <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
-          it is possible to generate summaries directly to an RDBMS via a custom reducer. The
-            <code>setup</code> method can connect to an RDBMS (the connection information can be
-          passed via custom parameters in the context) and the cleanup method can close the
-          connection. </para>
-        <para>It is critical to understand that number of reducers for the job affects the
-          summarization implementation, and you'll have to design this into your reducer.
-          Specifically, whether it is designed to run as a singleton (one reducer) or multiple
-          reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
-          reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
-          be created - this will scale, but only to a point. </para>
-        <programlisting language="java">
- public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
-
-	private Connection c = null;
-
-	public void setup(Context context) {
-  		// create DB connection...
-  	}
-
-	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-		// do summarization
-		// in this example the keys are Text, but this is just an example
-	}
-
-	public void cleanup(Context context) {
-  		// close db connection
-  	}
-
-}
-    </programlisting>
-        <para>In the end, the summary results are written to your RDBMS table/s. </para>
-      </section>
-
-    </section>
-    <!--  mr examples -->
-    <section
-      xml:id="mapreduce.htable.access">
-      <title>Accessing Other HBase Tables in a MapReduce Job</title>
-      <para>Although the framework currently allows one HBase table as input to a MapReduce job,
-        other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
-        an Table instance in the setup method of the Mapper.
-        <programlisting language="java">public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
-  private Table myOtherTable;
-
-  public void setup(Context context) {
-    // In here create a Connection to the cluster and save it or use the Connection
-    // from the existing table
-    myOtherTable = connection.getTable("myOtherTable");
-  }
-
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-	// process Result...
-	// use 'myOtherTable' for lookups
-  }
-
-  </programlisting>
-      </para>
-    </section>
-    <section
-      xml:id="mapreduce.specex">
-      <title>Speculative Execution</title>
-      <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
-        HBase as a source. This can either be done on a per-Job basis through properties, on on the
-        entire cluster. Especially for longer running jobs, speculative execution will create
-        duplicate map-tasks which will double-write your data to HBase; this is probably not what
-        you want. </para>
-      <para>See <xref
-          linkend="spec.ex" /> for more information. </para>
-    </section>
-  </chapter>  <!--  mapreduce -->
-
-  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />
-
-  <chapter xml:id="architecture">
-    <title>Architecture</title>
-	<section xml:id="arch.overview">
-	<title>Overview</title>
-	  <section xml:id="arch.overview.nosql">
-	  <title>NoSQL?</title>
-	  <para>HBase is a type of "NoSQL" database.  "NoSQL" is a general term meaning that the database isn't an RDBMS which
-	  supports SQL as its primary access language, but there are many types of NoSQL databases:  BerkeleyDB is an
-	  example of a local NoSQL database, whereas HBase is very much a distributed database.  Technically speaking,
-	  HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS,
-	  such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
-	  </para>
-	  <para>However, HBase has many features which supports both linear and modular scaling.  HBase clusters expand
-	  by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20
-	  RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
-	  RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best
-	  performance requires specialized hardware and storage devices.  HBase features of note are:
-	        <itemizedlist>
-              <listitem><para>Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.  This
-              makes it very suitable for tasks such as high-speed counter aggregation.</para>  </listitem>
-              <listitem><para>Automatic sharding:  HBase tables are distributed on the cluster via regions, and regions are
-              automatically split and re-distributed as your data grows.</para></listitem>
-              <listitem><para>Automatic RegionServer failover</para></listitem>
-              <listitem><para>Hadoop/HDFS Integration:  HBase supports HDFS out of the box as its distributed file system.</para></listitem>
-              <listitem><para>MapReduce:  HBase supports massively parallelized processing via MapReduce for using HBase as both
-              source and sink.</para></listitem>
-              <listitem><para>Java Client API:  HBase supports an easy to use Java API for programmatic access.</para></listitem>
-              <listitem><para>Thrift/REST API:  HBase also supports Thrift and REST for non-Java front-ends.</para></listitem>
-              <listitem><para>Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom Filters for high volume query optimization.</para></listitem>
-              <listitem><para>Operational Management:  HBase provides build-in web-pages for operational insight as well as JMX metrics.</para></listitem>
-            </itemizedlist>
-	  </para>
-      </section>
-
-	  <section xml:id="arch.overview.when">
-	    <title>When Should I Use HBase?</title>
-	    	  <para>HBase isn't suitable for every problem.</para>
-	          <para>First, make sure you have enough data.  If you have hundreds of millions or billions of rows, then
-	            HBase is a good candidate.  If you only have a few thousand/million rows, then using a traditional RDBMS
-	            might be a better choice due to the fact that all of your data might wind up on a single node (or two) and
-	            the rest of the cluster may be sitting idle.
-	          </para>
-	          <para>Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns,
-	          secondary indexes, transactions, advanced query languages, etc.)  An application built against an RDBMS cannot be
-	          "ported" to HBase by simply changing a JDBC driver, for example.  Consider moving from an RDBMS to HBase as a
-	          complete redesign as opposed to a port.
-              </para>
-	          <para>Third, make sure you have enough hardware.  Even HDFS doesn't do well with anything less than
-                5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
-                </para>
-                <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development
-                configuration only.
-                </para>
-      </section>
-      <section xml:id="arch.overview.hbasehdfs">
-        <title>What Is The Difference Between HBase and Hadoop/HDFS?</title>
-          <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files.
-          Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
-          HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
-          This can sometimes be a point of conceptual confusion.  HBase internally puts your data in indexed "StoreFiles" that exist
-          on HDFS for high-speed lookups.  See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals.
-         </para>
-      </section>
-	</section>
-
-    <section
-      xml:id="arch.catalog">
-      <title>Catalog Tables</title>
-      <para>The catalog table <code>hbase:meta</code> exists as an HBase table and is filtered out of the HBase
-        shell's <code>list</code> command, but is in fact a table just like any other. </para>
-      <section
-        xml:id="arch.catalog.root">
-        <title>-ROOT-</title>
-        <note>
-          <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. Information here should
-            be considered historical.</para>
-        </note>
-        <para>The <code>-ROOT-</code> table kept track of the location of the
-            <code>.META</code> table (the previous name for the table now called <code>hbase:meta</code>) prior to HBase
-          0.96. The <code>-ROOT-</code> table structure was as follows: </para>
-        <itemizedlist>
-          <title>Key</title>
-          <listitem>
-            <para>.META. region key (<code>.META.,,1</code>)</para>
-          </listitem>
-        </itemizedlist>
-
-        <itemizedlist>
-          <title>Values</title>
-          <listitem>
-            <para><code>info:regioninfo</code> (serialized <link
-                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">HRegionInfo</link>
-              instance of hbase:meta)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:server</code> (server:port of the RegionServer holding
-              hbase:meta)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:serverstartcode</code> (start-time of the RegionServer process holding
-              hbase:meta)</para>
-          </listitem>
-        </itemizedlist>
-      </section>
-      <section
-        xml:id="arch.catalog.meta">
-        <title>hbase:meta</title>
-        <para>The <code>hbase:meta</code> table (previously called <code>.META.</code>) keeps a list
-          of all regions in the system. The location of <code>hbase:meta</code> was previously
-          tracked within the <code>-ROOT-</code> table, but is now stored in Zookeeper.</para>
-        <para>The <code>hbase:meta</code> table structure is as follows: </para>
-        <itemizedlist>
-          <title>Key</title>
-          <listitem>
-            <para>Region key of the format (<code>[table],[region start key],[region
-              id]</code>)</para>
-          </listitem>
-        </itemizedlist>
-        <itemizedlist>
-          <title>Values</title>
-          <listitem>
-            <para><code>info:regioninfo</code> (serialized <link
-                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">
-                HRegionInfo</link> instance for this region)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:server</code> (server:port of the RegionServer containing this
-              region)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:serverstartcode</code> (start-time of the RegionServer process
-              containing this region)</para>
-          </listitem>
-        </itemizedlist>
-        <para>When a table is in the process of splitting, two other columns will be created, called
-            <code>info:splitA</code> and <code>info:splitB</code>. These columns represent the two
-          daughter regions. The values for these columns are also serialized HRegionInfo instances.
-          After the region has been split, eventually this row will be deleted. </para>
-        <note>
-          <title>Note on HRegionInfo</title>
-          <para>The empty key is used to denote table start and table end. A region with an empty
-            start key is the first region in a table. If a region has both an empty start and an
-            empty end key, it is the only region in the table </para>
-        </note>
-        <para>In the (hopefully unlikely) event that programmatic processing of catalog metadata is
-          required, see the <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</link>
-          utility. </para>
-      </section>
-      <section
-        xml:id="arc

<TRUNCATED>

Mime
View raw message