hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jmhs...@apache.org
Subject git commit: HBASE-11400 [docs] edit, consolidate, and update compression and data encoding docs (Misty Stanley-Jones)
Date Fri, 18 Jul 2014 20:49:38 GMT
Repository: hbase
Updated Branches:
  refs/heads/master a030b17ba -> 209dd6dcf

HBASE-11400 [docs] edit, consolidate, and update compression and data encoding docs (Misty

Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/209dd6dc
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/209dd6dc
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/209dd6dc

Branch: refs/heads/master
Commit: 209dd6dcfeb249060df091d651fc2d579aa729b5
Parents: a030b17
Author: Jonathan M Hsieh <jmhsieh@apache.org>
Authored: Fri Jul 18 13:45:57 2014 -0700
Committer: Jonathan M Hsieh <jmhsieh@apache.org>
Committed: Fri Jul 18 13:45:57 2014 -0700

 src/main/docbkx/book.xml                        | 627 +++++++++++++------
 .../images/data_block_diff_encoding.png         | Bin 0 -> 54479 bytes
 .../resources/images/data_block_no_encoding.png | Bin 0 -> 46836 bytes
 .../images/data_block_prefix_encoding.png       | Bin 0 -> 35271 bytes
 4 files changed, 424 insertions(+), 203 deletions(-)

diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index 92c372e..4c06dc6 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -4387,230 +4387,451 @@ This option should not normally be used, and it is not in <code>-fixAll</code>.
-  <appendix xml:id="compression">
+  <appendix
+    xml:id="compression">
-    <title >Compression In HBase<indexterm><primary>Compression</primary></indexterm></title>
+    <title>Compression and Data Block Encoding In
+          HBase<indexterm><primary>Compression</primary><secondary>Data
+          Encoding</secondary><seealso>codecs</seealso></indexterm></title>
-      <para>Codecs mentioned in this section are for encoding and decoding data blocks.
-        information about replication codecs, see <xref
+      <para>Codecs mentioned in this section are for encoding and decoding data blocks
or row keys.
+       For information about replication codecs, see <xref
           linkend="cluster.replication.preserving.tags" />.</para>
-    <para>There are a bunch of compression options in HBase.  Some codecs come with
java --
-        e.g. gzip -- and so require no additional installations. Others require native
-        libraries.  The native libraries may be available in your hadoop as is the case
-        with lz4 and it is just a matter of making sure the hadoop native .so is available
-        to HBase.  You may have to do extra work to make the codec accessible; for example,
-        if the codec has an apache-incompatible license that makes it so hadoop cannot bundle
-        the library.</para>
-        <para>Below we
-        discuss what is necessary for the common codecs.  Whatever codec you use, be sure
-        to test it is installed properly and is available on all nodes that make up your
-        Add any necessary operational step that will ensure checking the codec present when
-        happen to add new nodes to your cluster. The <xref linkend="compression.test"
-        discussed below can help check the codec is properly install.</para>
-        <para>As to which codec to use, there is some helpful discussion
-        to be found in <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting
Guidance on compression and codecs</link>.
-    </para>
+    <para>Some of the information in this section is pulled from a <link
+        xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link>
on the
+      HBase Development mailing list.</para>
+    <para>HBase supports several different compression algorithms which can be enabled
on a
+      ColumnFamily. Data block encoding attempts to limit duplication of information in keys,
+      advantage of some of the fundamental designs and patterns of HBase, such as sorted
row keys
+      and the schema of a given table. Compressors reduce the size of large, opaque byte
arrays in
+      cells, and can significantly reduce the storage space needed to store uncompressed
+      data.</para>
+    <para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
+    <formalpara>
+      <title>Changes Take Effect Upon Compaction</title>
+      <para>If you change compression or encoding for a ColumnFamily, the changes take
effect during
+       compaction.</para>
+    </formalpara>
+    <para>Some codecs take advantage of capabilities built into Java, such as GZip
+      Others rely on native libraries. Native libraries may be available as part of Hadoop,
such as
+      LZ4. In this case, HBase only needs access to the appropriate shared library. Other
+      such as Google Snappy, need to be installed first. Some codecs are licensed in ways
+      conflict with HBase's license and cannot be shipped as part of HBase.</para>
+    <para>This section discusses common codecs that are used and tested with HBase.
No matter what
+      codec you use, be sure to test that it is installed correctly and is available on all
nodes in
+      your cluster. Extra operational steps may be necessary to be sure that codecs are available
+      newly-deployed nodes. You can use the <xref
+        linkend="compression.test" /> utility to check that a given codec is correctly
+      installed.</para>
+    <para>To configure HBase to use a compressor, see <xref
+        linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see
+        linkend="changing.compression" />. To enable data block encoding for a ColumnFamily,
+      <xref linkend="data.block.encoding.enable" />.</para>
+    <itemizedlist>
+      <title>Block Compressors</title>
+      <listitem>
+        <para>none</para>
+      </listitem>
+      <listitem>
+        <para>Snappy</para>
+      </listitem>
+      <listitem>
+        <para>LZO</para>
+      </listitem>
+      <listitem>
+        <para>LZ4</para>
+      </listitem>
+      <listitem>
+        <para>GZ</para>
+      </listitem>
+    </itemizedlist>
-    <section xml:id="compression.test">
-    <title>CompressionTest Tool</title>
-    <para>
-    HBase includes a tool to test compression is set up properly.
-    To run it, type <code>/bin/hbase org.apache.hadoop.hbase.util.CompressionTest</code>.
-    This will emit usage on how to run the tool.
-    </para>
-    <note><title>You need to restart regionserver for it to pick up changes!</title>
-        <para>Be aware that the regionserver caches the result of the compression check
it runs
-            ahead of each region open.  This means that you will have to restart the regionserver
-            for it to notice that you have fixed any codec issues; e.g. changed symlinks
-            moved lib locations under HBase.</para>
-    </note>
-    <note xml:id="hbase.native.platform"><title>On the location of native libraries</title>
-        <para>Hadoop looks in <filename>lib/native</filename> for .so files.
 HBase looks in
-            <filename>lib/native/PLATFORM</filename>.  See the <command>bin/hbase</command>.
-            View the file and look for <varname>native</varname>.  See how we
-            do the work to find out what platform we are running on running a little java
-            <classname>org.apache.hadoop.util.PlatformName</classname> to figure
it out.
-            We'll then add <filename>./lib/native/PLATFORM</filename> to the
-            <varname>LD_LIBRARY_PATH</varname> environment for when the JVM starts.
-            The JVM will look in here (as well as in any other dirs specified on LD_LIBRARY_PATH)
-            for codec native libs.  If you are unable to figure your 'platform', do:
-            <programlisting>$ ./bin/hbase org.apache.hadoop.util.PlatformName</programlisting>.
-            An example platform would be <varname>Linux-amd64-64</varname>.
-            </para>
-    </note>
-    </section>
-    <section xml:id="hbase.regionserver.codecs">
-    <title>
-    <varname>
-    hbase.regionserver.codecs
-    </varname>
-    </title>
-    <para>
-    To have a RegionServer test a set of codecs and fail-to-start if any
-    code is missing or misinstalled, add the configuration
-    <varname>
-    hbase.regionserver.codecs
-    </varname>
-    to your <filename>hbase-site.xml</filename> with a value of
-    codecs to test on startup.  For example if the
-    <varname>
-    hbase.regionserver.codecs
-    </varname> value is <code>lzo,gz</code> and if lzo is not present
-    or improperly installed, the misconfigured RegionServer will fail
-    to start.
-    </para>
-    <para>
-    Administrators might make use of this facility to guard against
-    the case where a new server is added to cluster but the cluster
-    requires install of a particular coded.
-    </para>
-    </section>
+    <itemizedlist>
+      <title>Data Block Encoding Types</title>
+      <listitem>
+        <para>Prefix - Often, keys are very similar. Specifically, keys often share
a common prefix
+          and only differ near the end. For instance, one key might be
+            <literal>RowKey:Family:Qualifier0</literal> and the next key might
+            <literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding,
an extra column is
+          added which holds the length of the prefix shared between the current key and the
+          key. Assuming the first key here is totally different from the key before, its
+          length is 0. The second key's prefix length is <literal>23</literal>,
since they have the
+          first 23 characters in common.</para>
+        <para>Obviously if the keys tend to have nothing in common, Prefix will not
provide much
+          benefit.</para>
+        <para>The following image shows a hypothetical ColumnFamily with no data block
+        <figure>
+          <title>ColumnFamily with No Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_no_encoding.png" width="800"/>
+            </imageobject>
+            <textobject><para></para>
+            </textobject>
+          </mediaobject>
+        </figure>
+        <para>Here is the same data with prefix data encoding.</para>
+        <figure>
+          <title>ColumnFamily with Prefix Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_prefix_encoding.png" width="800"/>
+            </imageobject>
+            <textobject><para></para>
+            </textobject>
+          </mediaobject>
+        </figure>
+      </listitem>
+      <listitem>
+        <para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering
the key
+          sequentially as a monolithic series of bytes, each key field is split so that each
part of
+          the key can be compressed more efficiently. Two new fields are added: timestamp
and type.
+          If the ColumnFamily is the same as the previous row, it is omitted from the current
+          If the key length, value length or type are the same as the previous row, the field
+          omitted. In addition, for increased compression, the timestamp is stored as a Diff
+          the previous row's timestamp, rather than being stored in full. Given the two row
keys in
+          the Prefix example, and given an exact match on timestamp and the same type, neither
+          value length, or type needs to be stored for the second row, and the timestamp
value for
+          the second row is just 0, rather than a full timestamp.</para>
+        <para>Diff encoding is disabled by default because writing and scanning are
slower but more
+          data is cached.</para>
+        <para>This image shows the same ColumnFamily from the previous images, with
Diff encoding.</para>
+        <figure>
+          <title>ColumnFamily with Diff Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_diff_encoding.png" width="800"/>
+            </imageobject>
+            <textobject><para></para>
+            </textobject>
+          </mediaobject>
+        </figure>
+      </listitem>
+      <listitem>
+        <para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation.
It also
+          adds another field which stores a single bit to track whether the data itself is
the same
+          as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
+          codec to use if you have long keys or many columns. The data format is nearly identical
+        Diff encoding, so there is not an image to illustrate it.</para>
+      </listitem>
+      <listitem>
+        <para>Prefix Tree encoding was introduced as an experimental feature in HBase
0.96. It
+          provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but
+          faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
+          for applications that have high block cache hit ratios. It introduces new 'tree'
+          for the row and column. The row tree field contains a list of offsets/references
+          corresponding to the cells in that row. This allows for a good deal of compression.
+          more details about Prefix Tree encoding, see <link
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>.
It is
+          difficult to graphically illustrate a prefix tree, so no image is included. See
+          Wikipedia article for <link
+            xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more
general information
+          about this data structure.</para>
+      </listitem>
+    </itemizedlist>
-    <section xml:id="gzip.compression">
-    <title>
-    GZIP
-    </title>
-    <para>
-    GZIP will generally compress better than LZO but it will run slower.
-    For some setups, better compression may be preferred ('cold' data).
-    Java will use java's GZIP unless the native Hadoop libs are
-    available on the CLASSPATH; in this case it will use native
-    compressors instead (If the native libs are NOT present,
-    you will see lots of <emphasis>Got brand-new compressor</emphasis>
-    reports in your logs; see <xref linkend="brand.new.compressor" />).
-    </para>
+    <section>
+      <title>Which Compressor or Data Block Encoder To Use</title>
+      <para>The compression or codec type to use depends on the characteristics of
your data.
+        Choosing the wrong type could cause your data to take more space rather than less,
and can
+        have performance implications. In general, you need to weigh your options between
+        size and faster compression/decompression. Following are some general guidelines,
expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting
Guidance on compression and codecs</link>. </para>
+      <itemizedlist>
+        <listitem>
+          <para>If you have long keys (compared to the values) or many columns, use
a prefix
+            encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
+            encoding.</para>
+        </listitem>
+        <listitem>
+          <para>If the values are large (and not precompressed, such as images), use
a data block
+            compressor.</para>
+        </listitem>
+        <listitem>
+          <para>Use GZIP for <firstterm>cold data</firstterm>, which is
accessed infrequently. GZIP
+            compression uses more CPU resources than Snappy or LZO, but provides a higher
+            compression ratio.</para>
+        </listitem>
+        <listitem>
+          <para>Use Snappy or LZO for <firstterm>hot data</firstterm>,
which is accessed
+            frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide
as high
+          of a compression ratio.</para>
+        </listitem>
+        <listitem>
+          <para>In most cases, enabling Snappy or LZO by default is a good choice,
because they have
+            a low performance overhead and provide space savings.</para>
+        </listitem>
+        <listitem>
+          <para>Before Snappy became available by Google in 2011, LZO was the default.
Snappy has
+            similar qualities as LZO but has been shown to perform better.</para>
+        </listitem>
+      </itemizedlist>
-    <section xml:id="lz4.compression">
-    <title>
-        LZ4
-    </title>
-    <para>
-        LZ4 is bundled with Hadoop. Make sure the hadoop .so is
-        accessible when you start HBase.  One means of doing this is after figuring your
-        platform, see <xref linkend="hbase.native.platform" />, make a symlink from
-        to the native Hadoop libraries presuming the two software installs are colocated.
-        For example, if my 'platform' is Linux-amd64-64:
-        <programlisting>$ cd $HBASE_HOME
+    <section>
+      <title>Compressor Configuration, Installation, and Use</title>
+      <section
+        xml:id="compressor.install">
+        <title>Configure HBase For Compressors</title>
+        <para>Before HBase can use a given compressor, its libraries need to be available.
Due to
+          licensing issues, only GZ compression is available to HBase (via native Java libraries)
+          a default installation.</para>
+        <section>
+          <title>Compressor Support On the Master</title>
+          <para>A new configuration setting was introduced in HBase 0.95, to check
the Master to
+            determine which data block encoders are installed and configured on it, and assume
+            the entire cluster is configured the same. This option,
+              <code>hbase.master.check.compression</code>, defaults to <literal>true</literal>.
+            prevents the situation described in <link
+              xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>,
+            a table is created or modified to support a codec that a region server does not
+            leading to failures that take a long time to occur and are difficult to debug.
+          <para>If <code>hbase.master.check.compression</code> is enabled,
libraries for all desired
+            compressors need to be installed and configured on the Master, even if the Master
+            not run a region server.</para>
+        </section>
+        <section>
+          <title>Install GZ Support Via Native Libraries</title>
+          <para>HBase uses Java's built-in GZip support unless the native Hadoop libraries
+            available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH
is to
+            set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for
the user running
+            HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
+              brand-new compressor</literal> reports will be present in the logs. See
+              linkend="brand.new.compressor" />).</para>
+        </section>
+        <section
+          xml:id="lzo.compression">
+          <title>Install LZO Support</title>
+          <para>HBase cannot ship with LZO because of incompatibility between HBase,
which uses an
+            Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
+              xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
+              Compression</link> wiki page for information on configuring LZO support
for HBase. </para>
+          <para>If you depend upon LZO compression, consider configuring your RegionServers
to fail
+            to start if LZO is not available. See <xref
+              linkend="hbase.regionserver.codecs" />.</para>
+        </section>
+        <section
+          xml:id="lz4.compression">
+          <title>Configure LZ4 Support</title>
+          <para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
+            (libhadoop.so) is accessible when you start
+            HBase. After configuring your platform (see <xref
+              linkend="hbase.native.platform" />), you can make a symbolic link from HBase
to the native Hadoop
+            libraries. This assumes the two software installs are colocated. For example,
if my
+            'platform' is Linux-amd64-64:
+            <programlisting>$ cd $HBASE_HOME
 $ mkdir lib/native
 $ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
-        Use the compression tool to check lz4 installed on all nodes.
-        Start up (or restart) hbase. From here on out you will be able to create
-        and alter tables to enable LZ4 as a compression codec. E.g.:
-        <programlisting>hbase(main):003:0> alter 'TestTable', {NAME => 'info',
COMPRESSION => 'LZ4'}</programlisting>
-    </para>
-    </section>
-    <section xml:id="lzo.compression">
-    <title>
-    LZO
-    </title>
-      <para>Unfortunately, HBase cannot ship with LZO because of
-      the licensing issues; HBase is Apache-licensed, LZO is GPL.
-      Therefore LZO install is to be done post-HBase install.
-      See the <link xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using
LZO Compression</link>
-      wiki page for how to make LZO work with HBase.
-      </para>
-      <para>A common problem users run into when using LZO is that while initial
-      setup of the cluster runs smooth, a month goes by and some sysadmin goes to
-      add a machine to the cluster only they'll have forgotten to do the LZO
-      fixup on the new machine.  In versions since HBase 0.90.0, we should
-      fail in a way that makes it plain what the problem is, but maybe not. </para>
-      <para>See <xref linkend="hbase.regionserver.codecs" />
-      for a feature to help protect against failed LZO install.</para>
-    </section>
+            Use the compression tool to check that LZ4 is installed on all nodes. Start up
(or restart)
+            HBase. Afterward, you can create and alter tables to enable LZ4 as a
+            compression codec.:
+            <screen>
+hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION
=> 'LZ4'}</userinput>
+            </screen>
+          </para>
+        </section>
+        <section
+          xml:id="snappy.compression.installation">
+          <title>Install Snappy Support</title>
+          <para>HBase does not ship with Snappy support because of licensing issues.
You can install
+            Snappy binaries (for instance, by using <command>yum install snappy</command>
on CentOS)
+            or build Snappy from source. After installing Snappy, search for the shared library,
+            which will be called <filename>libsnappy.so.X</filename> where X
is a number. If you
+            built from source, copy the shared library to a known location on your system,
such as
+              <filename>/opt/snappy/lib/</filename>.</para>
+          <para>In addition to the Snappy library, HBase also needs access to the Hadoop
+            library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
+            where X and Y are both numbers. Make note of the location of the Hadoop library,
or copy
+            it to the same location as the Snappy library.</para>
+          <note>
+            <para>The Snappy and Hadoop libraries need to be available on each node
of your cluster.
+              See <xref
+                linkend="compression.test" /> to find out how to test that this is the
+            <para>See <xref
+                linkend="hbase.regionserver.codecs" /> to configure your RegionServers
to fail to
+              start if a given compressor is not available.</para>
+          </note>
+          <para>Each of these library locations need to be added to the environment
+              <envar>HBASE_LIBRARY_PATH</envar> for the operating system user
that runs HBase. You
+            need to restart the RegionServer for the changes to take effect.</para>
+        </section>
-    <section xml:id="snappy.compression">
-    <title>
-    </title>
-    <para>
-        If snappy is installed, HBase can make use of it (courtesy of
-        <link xlink:href="http://code.google.com/p/hadoop-snappy/">hadoop-snappy</link>
-        <footnote><para>See <link xlink:href="http://search-hadoop.com/m/Ds8d51c263B1/%2522Hadoop-Snappy+in+synch+with+Hadoop+trunk%2522&amp;subj=Hadoop+Snappy+in+synch+with+Hadoop+trunk">Alejandro's
note</link> up on the list on difference between Snappy in Hadoop
-        and Snappy in HBase</para></footnote>).
-        <orderedlist>
-            <listitem>
-                <para>
-                    Build and install <link xlink:href="http://code.google.com/p/snappy/">snappy</link>
on all nodes
-                    of your cluster (see below).  HBase nor Hadoop cannot include snappy
because of licensing issues (The
-                    hadoop libhadoop.so under its native dir does not include snappy; of
note, the shipped .so
-                    may be for 32-bit architectures -- this fact has tripped up folks in
the past with them thinking
-                    it 64-bit).  The notes below are about installing snappy for HBase use.
 You may want snappy
-                    available in your hadoop context also.  That is not covered here.
-                    HBase and Hadoop find the snappy .so in different locations currently:
Hadoop picks those files in
-                    <filename>./lib</filename> while HBase finds the .so in <filename>./lib/[PLATFORM]</filename>.
-                </para>
-            </listitem>
-            <listitem>
-                <para>
-        Use CompressionTest to verify snappy support is enabled and the libs can be loaded
ON ALL NODES of your cluster:
-        <programlisting>$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase
-                </para>
-            </listitem>
-            <listitem>
-                <para>
-        Create a column family with snappy compression and verify it in the hbase shell:
-        <programlisting>$ hbase> create 't1', { NAME => 'cf1', COMPRESSION =>
-hbase> describe 't1'</programlisting>
-        In the output of the "describe" command, you need to ensure it lists "COMPRESSION
=> 'SNAPPY'"
-                </para>
-            </listitem>
+        <section
+          xml:id="compression.test">
+          <title>CompressionTest</title>
+          <para>You can use the CompressionTest tool to verify that your compressor
is available to
+            HBase:</para>
+          <screen>
+ $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable>
+          </screen>
+        </section>
-        </orderedlist>
-    </para>
-    <section xml:id="snappy.compression.installation">
-    <title>
-    Installation
-    </title>
-    <para>Snappy is used by hbase to compress HFiles on flush and when compacting.
-    </para>
-    <para>
-        You will find the snappy library file under the .libs directory from your Snappy
build (For example
-        /home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x
is the version of the snappy
-        code you are building. You can either copy this file into your hbase lib directory
-- under lib/native/PLATFORM --
-        naming the file as libsnappy.so,
-        or simply create a symbolic link to it (See ./bin/hbase for how it does library path
for native libs).
-    </para>
+        <section
+          xml:id="hbase.regionserver.codecs">
+          <title>Enforce Compression Settings On a RegionServer</title>
+          <para>You can configure a RegionServer so that it will fail to restart if
compression is
+            configured incorrectly, by adding the option hbase.regionserver.codecs to the
+              <filename>hbase-site.xml</filename>, and setting its value to a
comma-separated list
+            of codecs that need to be available. For example, if you set this property to
+              <literal>lzo,gz</literal>, the RegionServer would fail to start
if both compressors
+            were not available. This would prevent a new server from being added to the cluster
+            without having codecs configured properly.</para>
+        </section>
+      </section>
-    <para>
-        The second file you need is the hadoop native library. You will find this file in
your hadoop installation directory
-        under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking
for is libhadoop.so.1.x.x.
-        Again, you can simply copy this file or link to it from under hbase in lib/native/PLATFORM
(e.g. Linux-amd64-64, etc.),
-        using the name libhadoop.so.
-    </para>
+      <section
+        xml:id="changing.compression">
+        <title>Enable Compression On a ColumnFamily</title>
+        <para>To enable compression for a ColumnFamily, use an <code>alter</code>
command. You do
+          not need to re-create the table or copy data. If you are changing codecs, be sure
the old
+          codec is still available until all the old StoreFiles have been compacted.</para>
+        <example>
+          <title>Enabling Compression on a ColumnFamily of an Existing Table using
+            Shell</title>
+          <screen><![CDATA[
+hbase> disable 'test'
+hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
+hbase> enable 'test']]>
+        </screen>
+        </example>
+        <example>
+          <title>Creating a New Table with Compression On a ColumnFamily</title>
+          <screen><![CDATA[
+hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }         
+          ]]></screen>
+        </example>
+        <example>
+          <title>Verifying a ColumnFamily's Compression Settings</title>
+          <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION                                          ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
+ => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
+ lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
+ LOCKCACHE => 'true'}
+1 row(s) in 0.1070 seconds
+          ]]></screen>
+        </example>
+      </section>
-    <para>
-        At the end of the installation, you should have both libsnappy.so and libhadoop.so
links or files present into
-        lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32 (where the last part of
the directory path is the
-        PLATFORM you built and rare running the native lib on)
-    </para>
-    <para>To point hbase at snappy support, in hbase-env.sh set
-        <programlisting>export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64</programlisting>
-        In <filename>/pathtoyourhadoop/lib/native/Linux-amd64-64</filename> you
should have something like:
-        <programlisting>
-        libsnappy.a
-        libsnappy.so
-        libsnappy.so.1
-        libsnappy.so.1.1.2
-    </programlisting>
-    </para>
-    </section>
+      <section>
+        <title>Testing Compression Performance</title>
+        <para>HBase includes a tool called LoadTestTool which provides mechanisms to
test your
+          compression performance. You must specify either <literal>-write</literal>
+          <literal>-update-read</literal> as your first parameter, and if you
do not specify another
+        parameter, usage advice is printed for each option.</para>
+        <example>
+          <title><command>LoadTestTool</command> Usage</title>
+          <screen><![CDATA[
+$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h            
+usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
+ -batchupdate                 Whether to use batch as opposed to separate
+                              updates for every column in a row
+ -bloom <arg>                 Bloom filter type, one of [NONE, ROW, ROWCOL]
+ -compression <arg>           Compression type, one of [LZO, GZ, NONE, SNAPPY,
+                              LZ4]
+ -data_block_encoding <arg>   Encoding algorithm (e.g. prefix compression) to
+                              use for data blocks in the test column family, one
+                              of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
+ -encryption <arg>            Enables transparent encryption on the test table,
+                              one of [AES]
+ -generator <arg>             The class which generates load for the tool. Any
+                              args for this class can be passed as colon
+                              separated after class name
+ -h,--help                    Show usage
+ -in_memory                   Tries to keep the HFiles of the CF inmemory as far
+                              as possible.  Not guaranteed that reads are always
+                              served from inmemory
+ -init_only                   Initialize the test table only, don't do any
+                              loading
+ -key_window <arg>            The 'key window' to maintain between reads and
+                              writes for concurrent write/read workload. The
+                              default is 0.
+ -max_read_errors <arg>       The maximum number of read errors to tolerate
+                              before terminating all reader threads. The default
+                              is 10.
+ -multiput                    Whether to use multi-puts as opposed to separate
+                              puts for every column in a row
+ -num_keys <arg>              The number of keys to read/write
+ -num_tables <arg>            A positive integer number. When a number n is
+                              speicfied, load test tool  will load n table
+                              parallely. -tn parameter value becomes table name
+                              prefix. Each table name is in format
+                              <tn>_1...<tn>_n
+ -read <arg>                  <verify_percent>[:<#threads=20>]
+ -regions_per_server <arg>    A positive integer number. When a number n is
+                              specified, load test tool will create the test
+                              table with n regions per server
+ -skip_init                   Skip the initialization; assume test table already
+                              exists
+ -start_key <arg>             The first key to read/write (a 0-based index). The
+                              default value is 0.
+ -tn <arg>                    The name of the table to read or write
+ -update <arg>                <update_percent>[:<#threads=20>][:<#whether
+                              ignore nonce collisions=0>]
+ -write <arg>                 <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
+ -zk <arg>                    ZK quorum as comma-separated host names without
+                              port numbers
+ -zk_root <arg>               name of parent znode in zookeeper            
+          ]]></screen>
+        </example>
+        <example>
+          <title>Example Usage of LoadTestTool</title>
+          <screen>
+$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
+          -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
+          </screen>
+        </example>
+      </section>
-    <section xml:id="changing.compression">
-      <title>Changing Compression Schemes</title>
-      <para>A frequent question on the dist-list is how to change compression schemes
for ColumnFamilies.  This is actually quite simple,
-      and can be done via an alter command.  Because the compression scheme is encoded at
the block-level in StoreFiles, the table does
-      <emphasis>not</emphasis> need to be re-created and the data does <emphasis>not</emphasis>
copied somewhere else.  Just make sure
-      the old codec is still available until you are sure that all of the old StoreFiles
have been compacted.
-      </para>
+    <section xml:id="data.block.encoding.enable">
+      <title>Enable Data Block Encoding</title>
+      <para>Codecs are built into HBase so no extra configuration is needed. Codecs
are enabled on a
+        table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable
the table before
+        altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
+      <example>
+        <title>Enable Data Block Encoding On a Table</title>
+        <screen><![CDATA[
+hbase>  disable 'test'
+hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
+Updating all regions with the new schema...
+0/1 regions updated.
+1/1 regions updated.
+0 row(s) in 2.2820 seconds
+hbase> enable 'test'
+0 row(s) in 0.1580 seconds          
+          ]]></screen>
+      </example>
+      <example>
+        <title>Verifying a ColumnFamily's Data Block Encoding</title>
+        <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION                                          ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
+ > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
+ e', BLOCKCACHE => 'true'}
+1 row(s) in 0.0650 seconds          
+        ]]></screen>
+      </example>
       <title xml:id="ycsb"><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB:
The Yahoo! Cloud Serving Benchmark</link> and HBase</title>
       <para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>

diff --git a/src/main/site/resources/images/data_block_diff_encoding.png b/src/main/site/resources/images/data_block_diff_encoding.png
new file mode 100644
index 0000000..0bd03a4
Binary files /dev/null and b/src/main/site/resources/images/data_block_diff_encoding.png differ

diff --git a/src/main/site/resources/images/data_block_no_encoding.png b/src/main/site/resources/images/data_block_no_encoding.png
new file mode 100644
index 0000000..56498b4
Binary files /dev/null and b/src/main/site/resources/images/data_block_no_encoding.png differ

diff --git a/src/main/site/resources/images/data_block_prefix_encoding.png b/src/main/site/resources/images/data_block_prefix_encoding.png
new file mode 100644
index 0000000..4271847
Binary files /dev/null and b/src/main/site/resources/images/data_block_prefix_encoding.png

View raw message