hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From st...@apache.org
Subject svn commit: r1085261 [2/3] - in /hbase/trunk/src/docbkx: book.xml getting_started.xml performance.xml preface.xml
Date Fri, 25 Mar 2011 06:19:19 GMT
Modified: hbase/trunk/src/docbkx/getting_started.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/getting_started.xml?rev=1085261&r1=1085260&r2=1085261&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/getting_started.xml (original)
+++ hbase/trunk/src/docbkx/getting_started.xml Fri Mar 25 06:19:18 2011
@@ -1,129 +1,125 @@
-<?xml version="1.0"?>
-  <chapter xml:id="getting_started"
-      version="5.0" xmlns="http://docbook.org/ns/docbook"
-      xmlns:xlink="http://www.w3.org/1999/xlink"
-      xmlns:xi="http://www.w3.org/2001/XInclude"
-      xmlns:svg="http://www.w3.org/2000/svg"
-      xmlns:m="http://www.w3.org/1998/Math/MathML"
-      xmlns:html="http://www.w3.org/1999/xhtml"
-      xmlns:db="http://docbook.org/ns/docbook">
-    <title>Getting Started</title>
-    <section >
-      <title>Introduction</title>
-      <para>
-          <link linkend="quickstart">Quick Start</link> will get you up and running
-          on a single-node instance of HBase using the local filesystem.
-          The <link linkend="notsoquick">Not-so-quick Start Guide</link> 
-          describes setup of HBase in distributed mode running on top of HDFS.
-      </para>
-    </section>
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter version="5.0" xml:id="getting_started"
+         xmlns="http://docbook.org/ns/docbook"
+         xmlns:xlink="http://www.w3.org/1999/xlink"
+         xmlns:xi="http://www.w3.org/2001/XInclude"
+         xmlns:svg="http://www.w3.org/2000/svg"
+         xmlns:m="http://www.w3.org/1998/Math/MathML"
+         xmlns:html="http://www.w3.org/1999/xhtml"
+         xmlns:db="http://docbook.org/ns/docbook">
+  <title>Getting Started</title>
+
+  <section>
+    <title>Introduction</title>
+
+    <para><link linkend="quickstart">Quick Start</link> will get you up and
+    running on a single-node instance of HBase using the local filesystem. The
+    <link linkend="notsoquick">Not-so-quick Start Guide</link> describes setup
+    of HBase in distributed mode running on top of HDFS.</para>
+  </section>
 
-    <section xml:id="quickstart">
-      <title>Quick Start</title>
+  <section xml:id="quickstart">
+    <title>Quick Start</title>
 
-          <para>This guide describes setup of a standalone HBase
-              instance that uses the local filesystem.  It leads you
-              through creating a table, inserting rows via the
-          <link linkend="shell">HBase Shell</link>, and then cleaning up and shutting
-          down your standalone HBase instance.
-          The below exercise should take no more than
-          ten minutes (not including download time).
-      </para>
-          
-          <section>
-            <title>Download and unpack the latest stable release.</title>
-
-            <para>Choose a download site from this list of <link
-            xlink:href="http://www.apache.org/dyn/closer.cgi/hbase/">Apache
-            Download Mirrors</link>. Click on suggested top link. This will take you to a
-            mirror of <emphasis>HBase Releases</emphasis>. Click on
-            the folder named <filename>stable</filename> and then download the
-            file that ends in <filename>.tar.gz</filename> to your local filesystem;
-            e.g. <filename>hbase-<?eval ${project.version}?>.tar.gz</filename>.</para>
+    <para>This guide describes setup of a standalone HBase instance that uses
+    the local filesystem. It leads you through creating a table, inserting
+    rows via the <link linkend="shell">HBase Shell</link>, and then cleaning
+    up and shutting down your standalone HBase instance. The below exercise
+    should take no more than ten minutes (not including download time).</para>
+
+    <section>
+      <title>Download and unpack the latest stable release.</title>
+
+      <para>Choose a download site from this list of <link
+      xlink:href="http://www.apache.org/dyn/closer.cgi/hbase/">Apache Download
+      Mirrors</link>. Click on suggested top link. This will take you to a
+      mirror of <emphasis>HBase Releases</emphasis>. Click on the folder named
+      <filename>stable</filename> and then download the file that ends in
+      <filename>.tar.gz</filename> to your local filesystem; e.g.
+      <filename>hbase-<?eval ${project.version}?>.tar.gz</filename>.</para>
 
-            <para>Decompress and untar your download and then change into the
-            unpacked directory.</para>
+      <para>Decompress and untar your download and then change into the
+      unpacked directory.</para>
 
-            <para><programlisting>$ tar xfz hbase-<?eval ${project.version}?>.tar.gz
+      <para><programlisting>$ tar xfz hbase-<?eval ${project.version}?>.tar.gz
 $ cd hbase-<?eval ${project.version}?>
 </programlisting></para>
 
-<para>
-   At this point, you are ready to start HBase. But before starting it,
-   you might want to edit <filename>conf/hbase-site.xml</filename>
-   and set the directory you want HBase to write to,
-   <varname>hbase.rootdir</varname>.
-   <programlisting>
-<![CDATA[
-<?xml version="1.0"?>
-<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-<configuration>
-  <property>
-    <name>hbase.rootdir</name>
-    <value>file:///DIRECTORY/hbase</value>
-  </property>
-</configuration>
-]]>
-</programlisting>
-Replace <varname>DIRECTORY</varname> in the above with a path to a directory where you want
-HBase to store its data.  By default, <varname>hbase.rootdir</varname> is
-set to <filename>/tmp/hbase-${user.name}</filename> 
-which means you'll lose all your data whenever your server reboots
-(Most operating systems clear <filename>/tmp</filename> on restart).
-</para>
-</section>
-<section xml:id="start_hbase">
-<title>Start HBase</title>
+      <para>At this point, you are ready to start HBase. But before starting
+      it, you might want to edit <filename>conf/hbase-site.xml</filename> and
+      set the directory you want HBase to write to,
+      <varname>hbase.rootdir</varname>. <programlisting>
+
+&lt;?xml version="1.0"?&gt;
+&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
+&lt;configuration&gt;
+  &lt;property&gt;
+    &lt;name&gt;hbase.rootdir&lt;/name&gt;
+    &lt;value&gt;file:///DIRECTORY/hbase&lt;/value&gt;
+  &lt;/property&gt;
+&lt;/configuration&gt;
+
+</programlisting> Replace <varname>DIRECTORY</varname> in the above with a
+      path to a directory where you want HBase to store its data. By default,
+      <varname>hbase.rootdir</varname> is set to
+      <filename>/tmp/hbase-${user.name}</filename> which means you'll lose all
+      your data whenever your server reboots (Most operating systems clear
+      <filename>/tmp</filename> on restart).</para>
+    </section>
+
+    <section xml:id="start_hbase">
+      <title>Start HBase</title>
 
-            <para>Now start HBase:<programlisting>$ ./bin/start-hbase.sh
+      <para>Now start HBase:<programlisting>$ ./bin/start-hbase.sh
 starting Master, logging to logs/hbase-user-master-example.org.out</programlisting></para>
 
-            <para>You should
-            now have a running standalone HBase instance. In standalone mode, HBase runs
-            all daemons in the the one JVM; i.e. both the HBase and ZooKeeper daemons.
-            HBase logs can be found in the <filename>logs</filename> subdirectory. Check them
-            out especially if HBase had trouble starting.</para>
-
-            <note>
-            <title>Is <application>java</application> installed?</title>
-            <para>All of the above presumes a 1.6 version of Oracle
-            <application>java</application> is installed on your
-            machine and available on your path; i.e. when you type
-            <application>java</application>, you see output that describes the options
-            the java program takes (HBase requires java 6).  If this is
-            not the case, HBase will not start.
-            Install java, edit <filename>conf/hbase-env.sh</filename>, uncommenting the
-            <envar>JAVA_HOME</envar> line pointing it to your java install.  Then,
-            retry the steps above.</para>
-            </note>
-            </section>
-            
+      <para>You should now have a running standalone HBase instance. In
+      standalone mode, HBase runs all daemons in the the one JVM; i.e. both
+      the HBase and ZooKeeper daemons. HBase logs can be found in the
+      <filename>logs</filename> subdirectory. Check them out especially if
+      HBase had trouble starting.</para>
+
+      <note>
+        <title>Is <application>java</application> installed?</title>
+
+        <para>All of the above presumes a 1.6 version of Oracle
+        <application>java</application> is installed on your machine and
+        available on your path; i.e. when you type
+        <application>java</application>, you see output that describes the
+        options the java program takes (HBase requires java 6). If this is not
+        the case, HBase will not start. Install java, edit
+        <filename>conf/hbase-env.sh</filename>, uncommenting the
+        <envar>JAVA_HOME</envar> line pointing it to your java install. Then,
+        retry the steps above.</para>
+      </note>
+    </section>
 
-      <section xml:id="shell_exercises">
-          <title>Shell Exercises</title>
-            <para>Connect to your running HBase via the 
-          <link linkend="shell">HBase Shell</link>.</para>
+    <section xml:id="shell_exercises">
+      <title>Shell Exercises</title>
 
-            <para><programlisting>$ ./bin/hbase shell
+      <para>Connect to your running HBase via the <link linkend="shell">HBase
+      Shell</link>.</para>
+
+      <para><programlisting>$ ./bin/hbase shell
 HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
 Type "exit&lt;RETURN&gt;" to leave the HBase Shell
-Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010
+Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
 
 hbase(main):001:0&gt; </programlisting></para>
 
-            <para>Type <command>help</command> and then <command>&lt;RETURN&gt;</command>
-            to see a listing of shell
-            commands and options. Browse at least the paragraphs at the end of
-            the help emission for the gist of how variables and command
-            arguments are entered into the
-            HBase shell; in particular note how table names, rows, and
-            columns, etc., must be quoted.</para>
-
-            <para>Create a table named <varname>test</varname> with a single
-            <link linkend="columnfamily">column family</link> named <varname>cf</varname>.
-            Verify its creation by listing all tables and then insert some
-            values.</para>
-            <para><programlisting>hbase(main):003:0&gt; create 'test', 'cf'
+      <para>Type <command>help</command> and then
+      <command>&lt;RETURN&gt;</command> to see a listing of shell commands and
+      options. Browse at least the paragraphs at the end of the help emission
+      for the gist of how variables and command arguments are entered into the
+      HBase shell; in particular note how table names, rows, and columns,
+      etc., must be quoted.</para>
+
+      <para>Create a table named <varname>test</varname> with a single <link
+      linkend="columnfamily">column family</link> named <varname>cf</varname>.
+      Verify its creation by listing all tables and then insert some
+      values.</para>
+
+      <para><programlisting>hbase(main):003:0&gt; create 'test', 'cf'
 0 row(s) in 1.2200 seconds
 hbase(main):003:0&gt; list 'table'
 test
@@ -135,314 +131,372 @@ hbase(main):005:0&gt; put 'test', 'row2'
 hbase(main):006:0&gt; put 'test', 'row3', 'cf:c', 'value3'
 0 row(s) in 0.0450 seconds</programlisting></para>
 
-            <para>Above we inserted 3 values, one at a time. The first insert is at
-            <varname>row1</varname>, column <varname>cf:a</varname> with a value of
-            <varname>value1</varname>.
-            Columns in HBase are comprised of a
-            <link linkend="columnfamily">column family</link> prefix
-            -- <varname>cf</varname> in this example -- followed by
-            a colon and then a column qualifier suffix (<varname>a</varname> in this case).
-            </para>
+      <para>Above we inserted 3 values, one at a time. The first insert is at
+      <varname>row1</varname>, column <varname>cf:a</varname> with a value of
+      <varname>value1</varname>. Columns in HBase are comprised of a <link
+      linkend="columnfamily">column family</link> prefix --
+      <varname>cf</varname> in this example -- followed by a colon and then a
+      column qualifier suffix (<varname>a</varname> in this case).</para>
 
-            <para>Verify the data insert.</para>
+      <para>Verify the data insert.</para>
 
-            <para>Run a scan of the table by doing the following</para>
+      <para>Run a scan of the table by doing the following</para>
 
-            <para><programlisting>hbase(main):007:0&gt; scan 'test'
+      <para><programlisting>hbase(main):007:0&gt; scan 'test'
 ROW        COLUMN+CELL
 row1       column=cf:a, timestamp=1288380727188, value=value1
 row2       column=cf:b, timestamp=1288380738440, value=value2
 row3       column=cf:c, timestamp=1288380747365, value=value3
 3 row(s) in 0.0590 seconds</programlisting></para>
 
-            <para>Get a single row as follows</para>
+      <para>Get a single row as follows</para>
 
-            <para><programlisting>hbase(main):008:0&gt; get 'test', 'row1'
+      <para><programlisting>hbase(main):008:0&gt; get 'test', 'row1'
 COLUMN      CELL
 cf:a        timestamp=1288380727188, value=value1
 1 row(s) in 0.0400 seconds</programlisting></para>
 
-            <para>Now, disable and drop your table. This will clean up all
-            done above.</para>
+      <para>Now, disable and drop your table. This will clean up all done
+      above.</para>
 
-            <para><programlisting>hbase(main):012:0&gt; disable 'test'
+      <para><programlisting>hbase(main):012:0&gt; disable 'test'
 0 row(s) in 1.0930 seconds
 hbase(main):013:0&gt; drop 'test'
 0 row(s) in 0.0770 seconds </programlisting></para>
 
-            <para>Exit the shell by typing exit.</para>
+      <para>Exit the shell by typing exit.</para>
 
-            <para><programlisting>hbase(main):014:0&gt; exit</programlisting></para>
-            </section>
+      <para><programlisting>hbase(main):014:0&gt; exit</programlisting></para>
+    </section>
 
-          <section xml:id="stopping">
-          <title>Stopping HBase</title>
-            <para>Stop your hbase instance by running the stop script.</para>
+    <section xml:id="stopping">
+      <title>Stopping HBase</title>
 
-            <para><programlisting>$ ./bin/stop-hbase.sh
-stopping hbase...............</programlisting></para>
-          </section>
+      <para>Stop your hbase instance by running the stop script.</para>
 
-      <section><title>Where to go next
-      </title>
-      <para>The above described standalone setup is good for testing and experiments only.
-      Move on to the next section, the <link linkend="notsoquick">Not-so-quick Start Guide</link>
-      where we'll go into depth on the different HBase run modes, requirements and critical
-      configurations needed setting up a distributed HBase deploy.
-      </para>
-      </section>
+      <para><programlisting>$ ./bin/stop-hbase.sh
+stopping hbase...............</programlisting></para>
     </section>
 
-    <section xml:id="notsoquick">
-      <title>Not-so-quick Start Guide</title>
-      
-      <section xml:id="requirements"><title>Requirements</title>
-      <para>HBase has the following requirements.  Please read the
-      section below carefully and ensure that all requirements have been
-      satisfied.  Failure to do so will cause you (and us) grief debugging
-      strange errors and/or data loss.
-      </para>
-
-  <section xml:id="java"><title>java</title>
-<para>
-  Just like Hadoop, HBase requires java 6 from <link xlink:href="http://www.java.com/download/">Oracle</link>.
-Usually you'll want to use the latest version available except the problematic u18  (u22 is the latest version as of this writing).</para>
-</section>
-
-  <section xml:id="hadoop"><title><link xlink:href="http://hadoop.apache.org">hadoop</link><indexterm><primary>Hadoop</primary></indexterm></title>
-<para>This version of HBase will only run on <link xlink:href="http://hadoop.apache.org/common/releases.html">Hadoop 0.20.x</link>.
-    It will not run on hadoop 0.21.x (nor 0.22.x) as of this writing.
-    HBase will lose data unless it is running on an HDFS that has a
-    durable <code>sync</code>.  Currently only the
-    <link xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
-    branch has this attribute
-    <footnote>
-    <para>
- See <link xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</link>
- in branch-0.20-append to see list of patches involved adding append on the Hadoop 0.20 branch.
- </para>
- </footnote>.
-    No official releases have been made from this branch up to now
-    so you will have to build your own Hadoop from the tip of this branch.
-    Check it out using this url, <link xlink:href="https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>.
-    Scroll down in the Hadoop <link xlink:href="http://wiki.apache.org/hadoop/HowToRelease">How To Release</link> to the section
-    <emphasis>Build Requirements</emphasis> for instruction on how to build Hadoop.
-    </para>
-
- <para>
- Or rather than build your own, you could use
- Cloudera's <link xlink:href="http://archive.cloudera.com/docs/">CDH3</link>.
- CDH has the 0.20-append patches needed to add a durable sync (CDH3 is still in beta.
- Either CDH3b2 or CDH3b3 will suffice).
- </para>
-
- <para>Because HBase depends on Hadoop, it bundles an instance of
- the Hadoop jar under its <filename>lib</filename> directory.
- The bundled Hadoop was made from the Apache branch-0.20-append branch
- at the time of this HBase's release.
- It is <emphasis>critical</emphasis> that the version of Hadoop that is
- out on your cluster matches what is Hbase match.  Replace the hadoop
- jar found in the HBase <filename>lib</filename> directory with the
- hadoop jar you are running out on your cluster to avoid version mismatch issues.
- Make sure you replace the jar all over your cluster.
- For example, versions of CDH do not have HDFS-724 whereas
- Hadoops branch-0.20-append branch does have HDFS-724. This
- patch changes the RPC version because protocol was changed.
- Version mismatch issues have various manifestations but often all looks like its hung up.
- </para>
-
- <note><title>Can I just replace the jar in Hadoop 0.20.2 tarball with the <emphasis>sync</emphasis>-supporting Hadoop jar found in HBase?</title>
- <para>
- You could do this.  It works going by a recent posting up on the
- <link xlink:href="http://www.apacheserver.net/Using-Hadoop-bundled-in-lib-directory-HBase-at1136240.htm">mailing list</link>.
- </para>
- </note>
- <note><title>Hadoop Security</title>
-     <para>HBase will run on any Hadoop 0.20.x that incorporates Hadoop security features -- e.g. Y! 0.20S or CDH3B3 -- as long
-         as you do as suggested above and replace the Hadoop jar that ships with HBase with the secure version.
-  </para>
-  </note>
+    <section>
+      <title>Where to go next</title>
 
+      <para>The above described standalone setup is good for testing and
+      experiments only. Move on to the next section, the <link
+      linkend="notsoquick">Not-so-quick Start Guide</link> where we'll go into
+      depth on the different HBase run modes, requirements and critical
+      configurations needed setting up a distributed HBase deploy.</para>
+    </section>
   </section>
-<section xml:id="ssh"> <title>ssh</title>
-<para><command>ssh</command> must be installed and <command>sshd</command> must
-be running to use Hadoop's scripts to manage remote Hadoop and HBase daemons.
-   You must be able to ssh to all nodes, including your local node, using passwordless login (Google "ssh passwordless login").
-  </para>
-</section>
-  <section xml:id="dns"><title>DNS</title>
-    <para>HBase uses the local hostname to self-report it's IP address. Both forward and reverse DNS resolving should work.</para>
-    <para>If your machine has multiple interfaces, HBase will use the interface that the primary hostname resolves to.</para>
-    <para>If this is insufficient, you can set <varname>hbase.regionserver.dns.interface</varname> to indicate the primary interface.
-    This only works if your cluster
-    configuration is consistent and every host has the same network interface configuration.</para>
-    <para>Another alternative is setting <varname>hbase.regionserver.dns.nameserver</varname> to choose a different nameserver than the
-    system wide default.</para>
-</section>
-  <section xml:id="ntp"><title>NTP</title>
-<para>
-    The clocks on cluster members should be in basic alignments. Some skew is tolerable but
-    wild skew could generate odd behaviors. Run <link xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link>
-    on your cluster, or an equivalent.
-  </para>
-    <para>If you are having problems querying data, or "weird" cluster operations, check system time!</para>
-</section>
 
+  <section xml:id="notsoquick">
+    <title>Not-so-quick Start Guide</title>
+
+    <section xml:id="requirements">
+      <title>Requirements</title>
+
+      <para>HBase has the following requirements. Please read the section
+      below carefully and ensure that all requirements have been satisfied.
+      Failure to do so will cause you (and us) grief debugging strange errors
+      and/or data loss.</para>
+
+      <section xml:id="java">
+        <title>java</title>
+
+        <para>Just like Hadoop, HBase requires java 6 from <link
+        xlink:href="http://www.java.com/download/">Oracle</link>. Usually
+        you'll want to use the latest version available except the problematic
+        u18 (u24 is the latest version as of this writing).</para>
+      </section>
+
+      <section xml:id="hadoop">
+        <title><link
+        xlink:href="http://hadoop.apache.org">hadoop</link><indexterm>
+            <primary>Hadoop</primary>
+          </indexterm></title>
+
+        <para>This version of HBase will only run on <link
+        xlink:href="http://hadoop.apache.org/common/releases.html">Hadoop
+        0.20.x</link>. It will not run on hadoop 0.21.x (nor 0.22.x). HBase
+        will lose data unless it is running on an HDFS that has a durable
+        <code>sync</code>. Currently only the <link
+        xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
+        branch has this attribute <footnote>
+            <para>See <link
+            xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</link>
+            in branch-0.20-append to see list of patches involved adding
+            append on the Hadoop 0.20 branch.</para>
+          </footnote>. No official releases have been made from this branch up
+        to now so you will have to build your own Hadoop from the tip of this
+        branch. Check it out using this url, <link
+        xlink:href="https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>.
+        Scroll down in the Hadoop <link
+        xlink:href="http://wiki.apache.org/hadoop/HowToRelease">How To
+        Release</link> to the section <emphasis>Build Requirements</emphasis>
+        for instruction on how to build Hadoop.</para>
+
+        <para>Or rather than build your own, you could use Cloudera's <link
+        xlink:href="http://archive.cloudera.com/docs/">CDH3</link>. CDH has
+        the 0.20-append patches needed to add a durable sync (CDH3 betas will
+        suffice; b2, b3, or b4).</para>
+
+        <para>Because HBase depends on Hadoop, it bundles an instance of the
+        Hadoop jar under its <filename>lib</filename> directory. The bundled
+        Hadoop was made from the Apache branch-0.20-append branch at the time
+        of this HBase's release. It is <emphasis>critical</emphasis> that the
+        version of Hadoop that is out on your cluster matches what is Hbase
+        match. Replace the hadoop jar found in the HBase
+        <filename>lib</filename> directory with the hadoop jar you are running
+        out on your cluster to avoid version mismatch issues. Make sure you
+        replace the jar all over your cluster. For example, versions of CDH do
+        not have HDFS-724 whereas Hadoops branch-0.20-append branch does have
+        HDFS-724. This patch changes the RPC version because protocol was
+        changed. Version mismatch issues have various manifestations but often
+        all looks like its hung up.</para>
+
+        <note>
+          <title>Can I just replace the jar in Hadoop 0.20.2 tarball with the
+          <emphasis>sync</emphasis>-supporting Hadoop jar found in
+          HBase?</title>
+
+          <para>You could do this. It works going by a recent posting up on
+          the <link
+          xlink:href="http://www.apacheserver.net/Using-Hadoop-bundled-in-lib-directory-HBase-at1136240.htm">mailing
+          list</link>.</para>
+        </note>
+
+        <note>
+          <title>Hadoop Security</title>
+
+          <para>HBase will run on any Hadoop 0.20.x that incorporates Hadoop
+          security features -- e.g. Y! 0.20S or CDH3B3 -- as long as you do as
+          suggested above and replace the Hadoop jar that ships with HBase
+          with the secure version.</para>
+        </note>
+      </section>
+
+      <section xml:id="ssh">
+        <title>ssh</title>
+
+        <para><command>ssh</command> must be installed and
+        <command>sshd</command> must be running to use Hadoop's scripts to
+        manage remote Hadoop and HBase daemons. You must be able to ssh to all
+        nodes, including your local node, using passwordless login (Google
+        "ssh passwordless login").</para>
+      </section>
+
+      <section xml:id="dns">
+        <title>DNS</title>
+
+        <para>HBase uses the local hostname to self-report it's IP address.
+        Both forward and reverse DNS resolving should work.</para>
+
+        <para>If your machine has multiple interfaces, HBase will use the
+        interface that the primary hostname resolves to.</para>
+
+        <para>If this is insufficient, you can set
+        <varname>hbase.regionserver.dns.interface</varname> to indicate the
+        primary interface. This only works if your cluster configuration is
+        consistent and every host has the same network interface
+        configuration.</para>
+
+        <para>Another alternative is setting
+        <varname>hbase.regionserver.dns.nameserver</varname> to choose a
+        different nameserver than the system wide default.</para>
+      </section>
+
+      <section xml:id="ntp">
+        <title>NTP</title>
+
+        <para>The clocks on cluster members should be in basic alignments.
+        Some skew is tolerable but wild skew could generate odd behaviors. Run
+        <link
+        xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link>
+        on your cluster, or an equivalent.</para>
+
+        <para>If you are having problems querying data, or "weird" cluster
+        operations, check system time!</para>
+      </section>
 
       <section xml:id="ulimit">
-      <title><varname>ulimit</varname><indexterm><primary>ulimit</primary></indexterm></title>
-      <para>HBase is a database, it uses a lot of files at the same time.
-      The default ulimit -n of 1024 on *nix systems is insufficient.
-      Any significant amount of loading will lead you to 
-      <link xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ#A6">FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?</link>.
-      You may also notice errors such as
-      <programlisting>
+        <title><varname>ulimit</varname><indexterm>
+            <primary>ulimit</primary>
+          </indexterm></title>
+
+        <para>HBase is a database, it uses a lot of files at the same time.
+        The default ulimit -n of 1024 on *nix systems is insufficient. Any
+        significant amount of loading will lead you to <link
+        xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ#A6">FAQ: Why do I
+        see "java.io.IOException...(Too many open files)" in my logs?</link>.
+        You may also notice errors such as <programlisting>
       2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
       2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
-      </programlisting>
-      Do yourself a favor and change the upper bound on the number of file descriptors.
-      Set it to north of 10k.  See the above referenced FAQ for how.</para>
-      <para>To be clear, upping the file descriptors for the user who is
-      running the HBase process is an operating system configuration, not an
-      HBase configuration. Also, a common mistake is that administrators
-      will up the file descriptors for a particular user but for whatever reason,
-      HBase will be running as some one else.  HBase prints in its logs
-      as the first line the ulimit its seeing.  Ensure its correct.
-    <footnote>
-    <para>A useful read setting config on you hadoop cluster is Aaron Kimballs'
-    <link xlink:ref="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration Parameters: What can you just ignore?</link>
-    </para>
-    </footnote>
-      </para>
+      </programlisting> Do yourself a favor and change the upper bound on the
+        number of file descriptors. Set it to north of 10k. See the above
+        referenced FAQ for how.</para>
+
+        <para>To be clear, upping the file descriptors for the user who is
+        running the HBase process is an operating system configuration, not an
+        HBase configuration. Also, a common mistake is that administrators
+        will up the file descriptors for a particular user but for whatever
+        reason, HBase will be running as some one else. HBase prints in its
+        logs as the first line the ulimit its seeing. Ensure its correct.
+        <footnote>
+            <para>A useful read setting config on you hadoop cluster is Aaron
+            Kimballs' <link
+            xlink:ref="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
+            Parameters: What can you just ignore?</link></para>
+          </footnote></para>
+
         <section xml:id="ulimit_ubuntu">
           <title><varname>ulimit</varname> on Ubuntu</title>
-        <para>
-          If you are on Ubuntu you will need to make the following changes:</para>
-        <para>
-          In the file <filename>/etc/security/limits.conf</filename> add a line like:
-          <programlisting>hadoop  -       nofile  32768</programlisting>
-          Replace <varname>hadoop</varname>
-          with whatever user is running Hadoop and HBase. If you have
-          separate users, you will need 2 entries, one for each user.
-        </para>
-        <para>
-          In the file <filename>/etc/pam.d/common-session</filename> add as the last line in the file:
-          <programlisting>session required  pam_limits.so</programlisting>
-          Otherwise the changes in <filename>/etc/security/limits.conf</filename> won't be applied.
-        </para>
-        <para>
-          Don't forget to log out and back in again for the changes to take effect!
-        </para>
-          </section>
+
+          <para>If you are on Ubuntu you will need to make the following
+          changes:</para>
+
+          <para>In the file <filename>/etc/security/limits.conf</filename> add
+          a line like: <programlisting>hadoop  -       nofile  32768</programlisting>
+          Replace <varname>hadoop</varname> with whatever user is running
+          Hadoop and HBase. If you have separate users, you will need 2
+          entries, one for each user.</para>
+
+          <para>In the file <filename>/etc/pam.d/common-session</filename> add
+          as the last line in the file: <programlisting>session required  pam_limits.so</programlisting>
+          Otherwise the changes in
+          <filename>/etc/security/limits.conf</filename> won't be
+          applied.</para>
+
+          <para>Don't forget to log out and back in again for the changes to
+          take effect!</para>
+        </section>
       </section>
 
       <section xml:id="dfs.datanode.max.xcievers">
-      <title><varname>dfs.datanode.max.xcievers</varname><indexterm><primary>xcievers</primary></indexterm></title>
-      <para>
-      An Hadoop HDFS datanode has an upper bound on the number of files
-      that it will serve at any one time.
-      The upper bound parameter is called
-      <varname>xcievers</varname> (yes, this is misspelled). Again, before
-      doing any loading, make sure you have configured
-      Hadoop's <filename>conf/hdfs-site.xml</filename>
-      setting the <varname>xceivers</varname> value to at least the following:
-      <programlisting>
+        <title><varname>dfs.datanode.max.xcievers</varname><indexterm>
+            <primary>xcievers</primary>
+          </indexterm></title>
+
+        <para>An Hadoop HDFS datanode has an upper bound on the number of
+        files that it will serve at any one time. The upper bound parameter is
+        called <varname>xcievers</varname> (yes, this is misspelled). Again,
+        before doing any loading, make sure you have configured Hadoop's
+        <filename>conf/hdfs-site.xml</filename> setting the
+        <varname>xceivers</varname> value to at least the following:
+        <programlisting>
       &lt;property&gt;
         &lt;name&gt;dfs.datanode.max.xcievers&lt;/name&gt;
         &lt;value&gt;4096&lt;/value&gt;
       &lt;/property&gt;
-      </programlisting>
-      </para>
-      <para>Be sure to restart your HDFS after making the above
-      configuration.</para>
-      <para>Not having this configuration in place makes for strange looking
-          failures. Eventually you'll see a complain in the datanode logs
-          complaining about the xcievers exceeded, but on the run up to this
-          one manifestation is complaint about missing blocks.  For example:
-          <code>10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...</code>
-      </para>
-      </section>
-
-<section xml:id="windows">
-<title>Windows</title>
-<para>
-HBase has been little tested running on windows.
-Running a production install of HBase on top of
-windows is not recommended.
-</para>
-<para>
-If you are running HBase on Windows, you must install
-<link xlink:href="http://cygwin.com/">Cygwin</link>
-to have a *nix-like environment for the shell scripts. The full details
-are explained in the <link xlink:href="http://hbase.apache.org/cygwin.html">Windows Installation</link>
-guide.
-</para>
-</section>
-
-      </section>
-
-      <section xml:id="standalone_dist"><title>HBase run modes: Standalone and Distributed</title>
-          <para>HBase has two run modes: <link linkend="standalone">standalone</link>
-              and <link linkend="distributed">distributed</link>.
-              Out of the box, HBase runs in standalone mode.  To set up a
-              distributed deploy, you will need to configure HBase by editing
-              files in the HBase <filename>conf</filename> directory.</para>
-
-<para>Whatever your mode, you will need to edit <code>conf/hbase-env.sh</code>
-to tell HBase which <command>java</command> to use. In this file
-you set HBase environment variables such as the heapsize and other options
-for the <application>JVM</application>, the preferred location for log files, etc.
-Set <varname>JAVA_HOME</varname> to point at the root of your
-<command>java</command> install.</para>
-
-      <section xml:id="standalone"><title>Standalone HBase</title>
-        <para>This is the default mode. Standalone mode is
-        what is described in the <link linkend="quickstart">quickstart</link>
-        section.  In standalone mode, HBase does not use HDFS -- it uses the local
-        filesystem instead -- and it runs all HBase daemons and a local zookeeper
-        all up in the same JVM.  Zookeeper binds to a well known port so clients may
-        talk to HBase.
-      </para>
-      </section>
-      <section xml:id="distributed"><title>Distributed</title>
-          <para>Distributed mode can be subdivided into distributed but all daemons run on a
-          single node -- a.k.a <emphasis>pseudo-distributed</emphasis>-- and
-          <emphasis>fully-distributed</emphasis> where the daemons 
-          are spread across all nodes in the cluster
-          <footnote><para>The pseudo-distributed vs fully-distributed nomenclature comes from Hadoop.</para></footnote>.</para>
-      <para>
-          Distributed modes require an instance of the
-          <emphasis>Hadoop Distributed File System</emphasis> (HDFS).  See the
-          Hadoop <link xlink:href="http://hadoop.apache.org/common/docs/current/api/overview-summary.html#overview_description">
-          requirements and instructions</link> for how to set up a HDFS.
-          Before proceeding, ensure you have an appropriate, working HDFS.
-      </para>
-      <para>Below we describe the different distributed setups.
-      Starting, verification and exploration of your install, whether a 
-      <emphasis>pseudo-distributed</emphasis> or <emphasis>fully-distributed</emphasis>
-      configuration is described in a section that follows,
-      <link linkend="confirm">Running and Confirming your Installation</link>.
-      The same verification script applies to both deploy types.</para>
-
-      <section xml:id="pseudo"><title>Pseudo-distributed</title>
-<para>A pseudo-distributed mode is simply a distributed mode run on a single host.
-Use this configuration testing and prototyping on HBase.  Do not use this configuration
-for production nor for evaluating HBase performance.
-</para>
-<para>Once you have confirmed your HDFS setup,
-edit <filename>conf/hbase-site.xml</filename>.  This is the file
-into which you add local customizations and overrides for 
-<link linkend="hbase_default_configurations">Default HBase Configurations</link>
-and <link linkend="hdfs_client_conf">HDFS Client Configurations</link>.
-Point HBase at the running Hadoop HDFS instance by setting the
-<varname>hbase.rootdir</varname> property.
-This property points HBase at the Hadoop filesystem instance to use.
-For example, adding the properties below to your
-<filename>hbase-site.xml</filename> says that HBase
-should use the <filename>/hbase</filename> 
-directory in the HDFS whose namenode is at port 9000 on your local machine, and that
-it should run with one replica only (recommended for pseudo-distributed mode):</para>
-<programlisting>
+      </programlisting></para>
+
+        <para>Be sure to restart your HDFS after making the above
+        configuration.</para>
+
+        <para>Not having this configuration in place makes for strange looking
+        failures. Eventually you'll see a complain in the datanode logs
+        complaining about the xcievers exceeded, but on the run up to this one
+        manifestation is complaint about missing blocks. For example:
+        <code>10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block
+        blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
+        java.io.IOException: No live nodes contain current block. Will get new
+        block locations from namenode and retry...</code></para>
+      </section>
+
+      <section xml:id="windows">
+        <title>Windows</title>
+
+        <para>HBase has been little tested running on windows. Running a
+        production install of HBase on top of windows is not
+        recommended.</para>
+
+        <para>If you are running HBase on Windows, you must install <link
+        xlink:href="http://cygwin.com/">Cygwin</link> to have a *nix-like
+        environment for the shell scripts. The full details are explained in
+        the <link xlink:href="http://hbase.apache.org/cygwin.html">Windows
+        Installation</link> guide. Also 
+        <link xlink:href="http://search-hadoop.com/?q=hbase+windows&amp;fc_project=HBase&amp;fc_type=mail+_hash_+dev">search our user mailing list</link> to pick
+        up latest fixes figured by windows users.</para>
+      </section>
+    </section>
+
+    <section xml:id="standalone_dist">
+      <title>HBase run modes: Standalone and Distributed</title>
+
+      <para>HBase has two run modes: <link
+      linkend="standalone">standalone</link> and <link
+      linkend="distributed">distributed</link>. Out of the box, HBase runs in
+      standalone mode. To set up a distributed deploy, you will need to
+      configure HBase by editing files in the HBase <filename>conf</filename>
+      directory.</para>
+
+      <para>Whatever your mode, you will need to edit
+      <code>conf/hbase-env.sh</code> to tell HBase which
+      <command>java</command> to use. In this file you set HBase environment
+      variables such as the heapsize and other options for the
+      <application>JVM</application>, the preferred location for log files,
+      etc. Set <varname>JAVA_HOME</varname> to point at the root of your
+      <command>java</command> install.</para>
+
+      <section xml:id="standalone">
+        <title>Standalone HBase</title>
+
+        <para>This is the default mode. Standalone mode is what is described
+        in the <link linkend="quickstart">quickstart</link> section. In
+        standalone mode, HBase does not use HDFS -- it uses the local
+        filesystem instead -- and it runs all HBase daemons and a local
+        zookeeper all up in the same JVM. Zookeeper binds to a well known port
+        so clients may talk to HBase.</para>
+      </section>
+
+      <section xml:id="distributed">
+        <title>Distributed</title>
+
+        <para>Distributed mode can be subdivided into distributed but all
+        daemons run on a single node -- a.k.a
+        <emphasis>pseudo-distributed</emphasis>-- and
+        <emphasis>fully-distributed</emphasis> where the daemons are spread
+        across all nodes in the cluster <footnote>
+            <para>The pseudo-distributed vs fully-distributed nomenclature
+            comes from Hadoop.</para>
+          </footnote>.</para>
+
+        <para>Distributed modes require an instance of the <emphasis>Hadoop
+        Distributed File System</emphasis> (HDFS). See the Hadoop <link
+        xlink:href="http://hadoop.apache.org/common/docs/current/api/overview-summary.html#overview_description">
+        requirements and instructions</link> for how to set up a HDFS. Before
+        proceeding, ensure you have an appropriate, working HDFS.</para>
+
+        <para>Below we describe the different distributed setups. Starting,
+        verification and exploration of your install, whether a
+        <emphasis>pseudo-distributed</emphasis> or
+        <emphasis>fully-distributed</emphasis> configuration is described in a
+        section that follows, <link linkend="confirm">Running and Confirming
+        your Installation</link>. The same verification script applies to both
+        deploy types.</para>
+
+        <section xml:id="pseudo">
+          <title>Pseudo-distributed</title>
+
+          <para>A pseudo-distributed mode is simply a distributed mode run on
+          a single host. Use this configuration testing and prototyping on
+          HBase. Do not use this configuration for production nor for
+          evaluating HBase performance.</para>
+
+          <para>Once you have confirmed your HDFS setup, edit
+          <filename>conf/hbase-site.xml</filename>. This is the file into
+          which you add local customizations and overrides for <link
+          linkend="hbase_default_configurations">Default HBase
+          Configurations</link> and <link linkend="hdfs_client_conf">HDFS
+          Client Configurations</link>. Point HBase at the running Hadoop HDFS
+          instance by setting the <varname>hbase.rootdir</varname> property.
+          This property points HBase at the Hadoop filesystem instance to use.
+          For example, adding the properties below to your
+          <filename>hbase-site.xml</filename> says that HBase should use the
+          <filename>/hbase</filename> directory in the HDFS whose namenode is
+          at port 9000 on your local machine, and that it should run with one
+          replica only (recommended for pseudo-distributed mode):</para>
+
+          <programlisting>
 &lt;configuration&gt;
   ...
   &lt;property&gt;
@@ -461,45 +515,45 @@ it should run with one replica only (rec
 &lt;/configuration&gt;
 </programlisting>
 
-<note>
-<para>Let HBase create the <varname>hbase.rootdir</varname>
-directory. If you don't, you'll get warning saying HBase
-needs a migration run because the directory is missing files
-expected by HBase (it'll create them if you let it).</para>
-</note>
-
-<note>
-<para>Above we bind to <varname>localhost</varname>.
-This means that a remote client cannot
-connect.  Amend accordingly, if you want to
-connect from a remote location.</para>
-</note>
-
-<para>Now skip to <link linkend="confirm">Running and Confirming your Installation</link>
-for how to start and verify your pseudo-distributed install.
-
-<footnote>
-    <para>See <link xlink:href="http://hbase.apache.org/pseudo-distributed.html">Pseudo-distributed mode extras</link>
-for notes on how to start extra Masters and regionservers when running
-    pseudo-distributed.</para>
-</footnote>
-</para>
-
-</section>
-
-      <section xml:id="fully_dist"><title>Fully-distributed</title>
-
-<para>For running a fully-distributed operation on more than one host, make
-the following configurations.  In <filename>hbase-site.xml</filename>,
-add the property <varname>hbase.cluster.distributed</varname> 
-and set it to <varname>true</varname> and point the HBase
-<varname>hbase.rootdir</varname> at the appropriate
-HDFS NameNode and location in HDFS where you would like
-HBase to write data. For example, if you namenode were running
-at namenode.example.org on port 9000 and you wanted to home
-your HBase in HDFS at <filename>/hbase</filename>,
-make the following configuration.</para>
-<programlisting>
+          <note>
+            <para>Let HBase create the <varname>hbase.rootdir</varname>
+            directory. If you don't, you'll get warning saying HBase needs a
+            migration run because the directory is missing files expected by
+            HBase (it'll create them if you let it).</para>
+          </note>
+
+          <note>
+            <para>Above we bind to <varname>localhost</varname>. This means
+            that a remote client cannot connect. Amend accordingly, if you
+            want to connect from a remote location.</para>
+          </note>
+
+          <para>Now skip to <link linkend="confirm">Running and Confirming
+          your Installation</link> for how to start and verify your
+          pseudo-distributed install. <footnote>
+              <para>See <link
+              xlink:href="http://hbase.apache.org/pseudo-distributed.html">Pseudo-distributed
+              mode extras</link> for notes on how to start extra Masters and
+              regionservers when running pseudo-distributed.</para>
+            </footnote></para>
+        </section>
+
+        <section xml:id="fully_dist">
+          <title>Fully-distributed</title>
+
+          <para>For running a fully-distributed operation on more than one
+          host, make the following configurations. In
+          <filename>hbase-site.xml</filename>, add the property
+          <varname>hbase.cluster.distributed</varname> and set it to
+          <varname>true</varname> and point the HBase
+          <varname>hbase.rootdir</varname> at the appropriate HDFS NameNode
+          and location in HDFS where you would like HBase to write data. For
+          example, if you namenode were running at namenode.example.org on
+          port 9000 and you wanted to home your HBase in HDFS at
+          <filename>/hbase</filename>, make the following
+          configuration.</para>
+
+          <programlisting>
 &lt;configuration&gt;
   ...
   &lt;property&gt;
@@ -520,91 +574,97 @@ make the following configuration.</para>
 &lt;/configuration&gt;
 </programlisting>
 
-<section xml:id="regionserver"><title><filename>regionservers</filename></title>
-<para>In addition, a fully-distributed mode requires that you
-modify <filename>conf/regionservers</filename>.
-The <filename><link linkend="regionservrers">regionservers</link></filename> file lists all hosts
-that you would have running <application>HRegionServer</application>s, one host per line
-(This file in HBase is like the Hadoop <filename>slaves</filename> file).  All servers
-listed in this file will be started and stopped when HBase cluster start or stop is run.</para>
-</section>
-
-<section xml:id="zookeeper"><title>ZooKeeper<indexterm><primary>ZooKeeper</primary></indexterm></title>
-<para>A distributed HBase depends on a running ZooKeeper cluster.
-All participating nodes and clients
-need to be able to access the running ZooKeeper ensemble.
-HBase by default manages a ZooKeeper "cluster" for you.
-It will start and stop the ZooKeeper ensemble as part of
-the HBase start/stop process.  You can also manage
-the ZooKeeper ensemble independent of HBase and 
-just point HBase at the cluster it should use.
-To toggle HBase management of ZooKeeper,
-use the <varname>HBASE_MANAGES_ZK</varname> variable in
-<filename>conf/hbase-env.sh</filename>.
-This variable, which defaults to <varname>true</varname>, tells HBase whether to
-start/stop the ZooKeeper ensemble servers as part of HBase start/stop.</para>
-
-<para>When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration
-using its native <filename>zoo.cfg</filename> file, or, the easier option
-is to just specify ZooKeeper options directly in <filename>conf/hbase-site.xml</filename>.
-A ZooKeeper configuration option can be set as a property in the HBase
-<filename>hbase-site.xml</filename>
-XML configuration file by prefacing the ZooKeeper option name with
-<varname>hbase.zookeeper.property</varname>.
-For example, the <varname>clientPort</varname> setting in ZooKeeper can be changed by
-setting the <varname>hbase.zookeeper.property.clientPort</varname> property.
-
-For all default values used by HBase, including ZooKeeper configuration,
-see the section
-<link linkend="hbase_default_configurations">Default HBase Configurations</link>.
-Look for the <varname>hbase.zookeeper.property</varname> prefix
-
-<footnote><para>For the full list of ZooKeeper configurations,
-see ZooKeeper's <filename>zoo.cfg</filename>.
-HBase does not ship with a <filename>zoo.cfg</filename> so you will need to
-browse the <filename>conf</filename> directory in an appropriate ZooKeeper download.
-</para>
-</footnote>
-</para>
-
-
-
-<para>You must at least list the ensemble servers in <filename>hbase-site.xml</filename>
-using the <varname>hbase.zookeeper.quorum</varname> property.
-This property defaults to a single ensemble member at
-<varname>localhost</varname> which is not suitable for a
-fully distributed HBase. (It binds to the local machine only and remote clients
-will not be able to connect).
-<note xml:id="how_many_zks">
-<title>How many ZooKeepers should I run?</title>
-<para>
-You can run a ZooKeeper ensemble that comprises 1 node only but
-in production it is recommended that you run a ZooKeeper ensemble of
-3, 5 or 7 machines; the more members an ensemble has, the more
-tolerant the ensemble is of host failures. Also, run an odd number of machines.
-There can be no quorum if the number of members is an even number.  Give each
-ZooKeeper server around 1GB of RAM, and if possible, its own dedicated disk
-(A dedicated disk is the best thing you can do to ensure a performant ZooKeeper
-ensemble).  For very heavily loaded clusters, run ZooKeeper servers on separate machines from
-RegionServers (DataNodes and TaskTrackers).</para>
-</note>
-</para>
-
-
-<para>For example, to have HBase manage a ZooKeeper quorum on nodes
-<emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to port 2222 (the default is 2181)
-ensure <varname>HBASE_MANAGE_ZK</varname> is commented out or set to
-<varname>true</varname> in <filename>conf/hbase-env.sh</filename> and
-then edit <filename>conf/hbase-site.xml</filename> and set 
-<varname>hbase.zookeeper.property.clientPort</varname>
-and
-<varname>hbase.zookeeper.quorum</varname>.  You should also
-set
-<varname>hbase.zookeeper.property.dataDir</varname>
-to other than the default as the default has ZooKeeper persist data under
-<filename>/tmp</filename> which is often cleared on system restart.
-In the example below we have ZooKeeper persist to <filename>/user/local/zookeeper</filename>.
-<programlisting>
+          <section xml:id="regionserver">
+            <title><filename>regionservers</filename></title>
+
+            <para>In addition, a fully-distributed mode requires that you
+            modify <filename>conf/regionservers</filename>. The
+            <filename><link
+            linkend="regionservrers">regionservers</link></filename> file
+            lists all hosts that you would have running
+            <application>HRegionServer</application>s, one host per line (This
+            file in HBase is like the Hadoop <filename>slaves</filename>
+            file). All servers listed in this file will be started and stopped
+            when HBase cluster start or stop is run.</para>
+          </section>
+
+          <section xml:id="zookeeper">
+            <title>ZooKeeper<indexterm>
+                <primary>ZooKeeper</primary>
+              </indexterm></title>
+
+            <para>A distributed HBase depends on a running ZooKeeper cluster.
+            All participating nodes and clients need to be able to access the
+            running ZooKeeper ensemble. HBase by default manages a ZooKeeper
+            "cluster" for you. It will start and stop the ZooKeeper ensemble
+            as part of the HBase start/stop process. You can also manage the
+            ZooKeeper ensemble independent of HBase and just point HBase at
+            the cluster it should use. To toggle HBase management of
+            ZooKeeper, use the <varname>HBASE_MANAGES_ZK</varname> variable in
+            <filename>conf/hbase-env.sh</filename>. This variable, which
+            defaults to <varname>true</varname>, tells HBase whether to
+            start/stop the ZooKeeper ensemble servers as part of HBase
+            start/stop.</para>
+
+            <para>When HBase manages the ZooKeeper ensemble, you can specify
+            ZooKeeper configuration using its native
+            <filename>zoo.cfg</filename> file, or, the easier option is to
+            just specify ZooKeeper options directly in
+            <filename>conf/hbase-site.xml</filename>. A ZooKeeper
+            configuration option can be set as a property in the HBase
+            <filename>hbase-site.xml</filename> XML configuration file by
+            prefacing the ZooKeeper option name with
+            <varname>hbase.zookeeper.property</varname>. For example, the
+            <varname>clientPort</varname> setting in ZooKeeper can be changed
+            by setting the
+            <varname>hbase.zookeeper.property.clientPort</varname> property.
+            For all default values used by HBase, including ZooKeeper
+            configuration, see the section <link
+            linkend="hbase_default_configurations">Default HBase
+            Configurations</link>. Look for the
+            <varname>hbase.zookeeper.property</varname> prefix <footnote>
+                <para>For the full list of ZooKeeper configurations, see
+                ZooKeeper's <filename>zoo.cfg</filename>. HBase does not ship
+                with a <filename>zoo.cfg</filename> so you will need to browse
+                the <filename>conf</filename> directory in an appropriate
+                ZooKeeper download.</para>
+              </footnote></para>
+
+            <para>You must at least list the ensemble servers in
+            <filename>hbase-site.xml</filename> using the
+            <varname>hbase.zookeeper.quorum</varname> property. This property
+            defaults to a single ensemble member at
+            <varname>localhost</varname> which is not suitable for a fully
+            distributed HBase. (It binds to the local machine only and remote
+            clients will not be able to connect). <note xml:id="how_many_zks">
+                <title>How many ZooKeepers should I run?</title>
+
+                <para>You can run a ZooKeeper ensemble that comprises 1 node
+                only but in production it is recommended that you run a
+                ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
+                ensemble has, the more tolerant the ensemble is of host
+                failures. Also, run an odd number of machines. There can be no
+                quorum if the number of members is an even number. Give each
+                ZooKeeper server around 1GB of RAM, and if possible, its own
+                dedicated disk (A dedicated disk is the best thing you can do
+                to ensure a performant ZooKeeper ensemble). For very heavily
+                loaded clusters, run ZooKeeper servers on separate machines
+                from RegionServers (DataNodes and TaskTrackers).</para>
+              </note></para>
+
+            <para>For example, to have HBase manage a ZooKeeper quorum on
+            nodes <emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to
+            port 2222 (the default is 2181) ensure
+            <varname>HBASE_MANAGE_ZK</varname> is commented out or set to
+            <varname>true</varname> in <filename>conf/hbase-env.sh</filename>
+            and then edit <filename>conf/hbase-site.xml</filename> and set
+            <varname>hbase.zookeeper.property.clientPort</varname> and
+            <varname>hbase.zookeeper.quorum</varname>. You should also set
+            <varname>hbase.zookeeper.property.dataDir</varname> to other than
+            the default as the default has ZooKeeper persist data under
+            <filename>/tmp</filename> which is often cleared on system
+            restart. In the example below we have ZooKeeper persist to
+            <filename>/user/local/zookeeper</filename>. <programlisting>
   &lt;configuration&gt;
     ...
     &lt;property&gt;
@@ -628,183 +688,228 @@ In the example below we have ZooKeeper p
     &lt;property&gt;
       &lt;name&gt;hbase.zookeeper.property.dataDir&lt;/name&gt;
       &lt;value&gt;/usr/local/zookeeper&lt;/value&gt;
-      &lt;description>Property from ZooKeeper's config zoo.cfg.
+      &lt;description&gt;Property from ZooKeeper's config zoo.cfg.
       The directory where the snapshot is stored.
       &lt;/description&gt;
     &lt;/property&gt;
     ...
-  &lt;/configuration&gt;</programlisting>
-</para>
+  &lt;/configuration&gt;</programlisting></para>
 
-<section><title>Using existing ZooKeeper ensemble</title>
-<para>To point HBase at an existing ZooKeeper cluster,
-one that is not managed by HBase,
-set <varname>HBASE_MANAGES_ZK</varname> in 
-<filename>conf/hbase-env.sh</filename> to false
-<programlisting>
+            <section>
+              <title>Using existing ZooKeeper ensemble</title>
+
+              <para>To point HBase at an existing ZooKeeper cluster, one that
+              is not managed by HBase, set <varname>HBASE_MANAGES_ZK</varname>
+              in <filename>conf/hbase-env.sh</filename> to false
+              <programlisting>
   ...
   # Tell HBase whether it should manage it's own instance of Zookeeper or not.
-  export HBASE_MANAGES_ZK=false</programlisting>
+  export HBASE_MANAGES_ZK=false</programlisting> Next set ensemble locations
+              and client port, if non-standard, in
+              <filename>hbase-site.xml</filename>, or add a suitably
+              configured <filename>zoo.cfg</filename> to HBase's
+              <filename>CLASSPATH</filename>. HBase will prefer the
+              configuration found in <filename>zoo.cfg</filename> over any
+              settings in <filename>hbase-site.xml</filename>.</para>
+
+              <para>When HBase manages ZooKeeper, it will start/stop the
+              ZooKeeper servers as a part of the regular start/stop scripts.
+              If you would like to run ZooKeeper yourself, independent of
+              HBase start/stop, you would do the following</para>
 
-Next set ensemble locations and client port, if non-standard,
-in <filename>hbase-site.xml</filename>,
-or add a suitably configured <filename>zoo.cfg</filename> to HBase's <filename>CLASSPATH</filename>.
-HBase will prefer the configuration found in <filename>zoo.cfg</filename>
-over any settings in <filename>hbase-site.xml</filename>.
-</para>
-
-<para>When HBase manages ZooKeeper, it will start/stop the ZooKeeper servers as a part
-of the regular start/stop scripts. If you would like to run ZooKeeper yourself,
-independent of HBase start/stop, you would do the following</para>
-<programlisting>
+              <programlisting>
 ${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
 </programlisting>
 
-<para>Note that you can use HBase in this manner to spin up a ZooKeeper cluster,
-unrelated to HBase. Just make sure to set <varname>HBASE_MANAGES_ZK</varname> to
-<varname>false</varname> if you want it to stay up across HBase restarts
-so that when HBase shuts down, it doesn't take ZooKeeper down with it.</para>
-
-<para>For more information about running a distinct ZooKeeper cluster, see
-the ZooKeeper <link xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting Started Guide</link>.
-</para>
-</section>
-</section>
-
-<section xml:id="hdfs_client_conf">
-<title>HDFS Client Configuration</title>
-<para>Of note, if you have made <emphasis>HDFS client configuration</emphasis> on your Hadoop cluster
--- i.e. configuration you want HDFS clients to use as opposed to server-side configurations --
-HBase will not see this configuration unless you do one of the following:</para>
-<itemizedlist>
-  <listitem><para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname>
-  to the <varname>HBASE_CLASSPATH</varname> environment variable
-  in <filename>hbase-env.sh</filename>.</para></listitem>
-  <listitem><para>Add a copy of <filename>hdfs-site.xml</filename>
-  (or <filename>hadoop-site.xml</filename>) or, better, symlinks,
-  under
-  <filename>${HBASE_HOME}/conf</filename>, or</para></listitem>
-  <listitem><para>if only a small set of HDFS client
-  configurations, add them to <filename>hbase-site.xml</filename>.</para></listitem>
-</itemizedlist>
-
-<para>An example of such an HDFS client configuration is <varname>dfs.replication</varname>. If for example,
-you want to run with a replication factor of 5, hbase will create files with the default of 3 unless
-you do the above to make the configuration available to HBase.</para>
-</section>
-      </section>
-      </section>
-
-<section xml:id="confirm"><title>Running and Confirming Your Installation</title>
-<para>Make sure HDFS is running first.
-Start and stop the Hadoop HDFS daemons by running <filename>bin/start-hdfs.sh</filename>
-over in the <varname>HADOOP_HOME</varname> directory.
-You can ensure it started properly by testing the <command>put</command> and
-<command>get</command> of files into the Hadoop filesystem.
-HBase does not normally use the mapreduce daemons.  These do not need to be started.</para>
-
-<para><emphasis>If</emphasis> you are managing your own ZooKeeper, start it
-and confirm its running else, HBase will start up ZooKeeper for you as part
-of its start process.</para>
-
-<para>Start HBase with the following command:</para>
-<programlisting>bin/start-hbase.sh</programlisting>
-Run the above from the <varname>HBASE_HOME</varname> directory.
-
-<para>You should now have a running HBase instance.
-HBase logs can be found in the <filename>logs</filename> subdirectory. Check them
-out especially if HBase had trouble starting.</para>
-
-<para>HBase also puts up a UI listing vital attributes. By default its deployed on the Master host
-at port 60010 (HBase RegionServers listen on port 60020 by default and put up an informational
-http server at 60030). If the Master were running on a host named <varname>master.example.org</varname>
-on the default port, to see the Master's homepage you'd point your browser at
-<filename>http://master.example.org:60010</filename>.</para>
-
-<para>Once HBase has started, see the
-<link linkend="shell_exercises">Shell Exercises</link> section for how to
-create tables, add data, scan your insertions, and finally disable and
-drop your tables.
-</para>
-
-<para>To stop HBase after exiting the HBase shell enter
-<programlisting>$ ./bin/stop-hbase.sh
-stopping hbase...............</programlisting>
-Shutdown can take a moment to complete.  It can take longer if your cluster
-is comprised of many machines.  If you are running a distributed operation,
-be sure to wait until HBase has shut down completely
-before stopping the Hadoop daemons.</para>
-
-
-
-</section>
-</section>
+              <para>Note that you can use HBase in this manner to spin up a
+              ZooKeeper cluster, unrelated to HBase. Just make sure to set
+              <varname>HBASE_MANAGES_ZK</varname> to <varname>false</varname>
+              if you want it to stay up across HBase restarts so that when
+              HBase shuts down, it doesn't take ZooKeeper down with it.</para>
+
+              <para>For more information about running a distinct ZooKeeper
+              cluster, see the ZooKeeper <link
+              xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting
+              Started Guide</link>.</para>
+            </section>
+          </section>
+
+          <section xml:id="hdfs_client_conf">
+            <title>HDFS Client Configuration</title>
+
+            <para>Of note, if you have made <emphasis>HDFS client
+            configuration</emphasis> on your Hadoop cluster -- i.e.
+            configuration you want HDFS clients to use as opposed to
+            server-side configurations -- HBase will not see this
+            configuration unless you do one of the following:</para>
+
+            <itemizedlist>
+              <listitem>
+                <para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname>
+                to the <varname>HBASE_CLASSPATH</varname> environment variable
+                in <filename>hbase-env.sh</filename>.</para>
+              </listitem>
+
+              <listitem>
+                <para>Add a copy of <filename>hdfs-site.xml</filename> (or
+                <filename>hadoop-site.xml</filename>) or, better, symlinks,
+                under <filename>${HBASE_HOME}/conf</filename>, or</para>
+              </listitem>
+
+              <listitem>
+                <para>if only a small set of HDFS client configurations, add
+                them to <filename>hbase-site.xml</filename>.</para>
+              </listitem>
+            </itemizedlist>
+
+            <para>An example of such an HDFS client configuration is
+            <varname>dfs.replication</varname>. If for example, you want to
+            run with a replication factor of 5, hbase will create files with
+            the default of 3 unless you do the above to make the configuration
+            available to HBase.</para>
+          </section>
+        </section>
+      </section>
+
+      <section xml:id="confirm">
+        <title>Running and Confirming Your Installation</title>
+
+         
+
+        <para>Make sure HDFS is running first. Start and stop the Hadoop HDFS
+        daemons by running <filename>bin/start-hdfs.sh</filename> over in the
+        <varname>HADOOP_HOME</varname> directory. You can ensure it started
+        properly by testing the <command>put</command> and
+        <command>get</command> of files into the Hadoop filesystem. HBase does
+        not normally use the mapreduce daemons. These do not need to be
+        started.</para>
+
+         
+
+        <para><emphasis>If</emphasis> you are managing your own ZooKeeper,
+        start it and confirm its running else, HBase will start up ZooKeeper
+        for you as part of its start process.</para>
+
+         
+
+        <para>Start HBase with the following command:</para>
+
+         
+
+        <programlisting>bin/start-hbase.sh</programlisting>
+
+         Run the above from the 
+
+        <varname>HBASE_HOME</varname>
 
+         directory. 
 
+        <para>You should now have a running HBase instance. HBase logs can be
+        found in the <filename>logs</filename> subdirectory. Check them out
+        especially if HBase had trouble starting.</para>
 
+         
 
+        <para>HBase also puts up a UI listing vital attributes. By default its
+        deployed on the Master host at port 60010 (HBase RegionServers listen
+        on port 60020 by default and put up an informational http server at
+        60030). If the Master were running on a host named
+        <varname>master.example.org</varname> on the default port, to see the
+        Master's homepage you'd point your browser at
+        <filename>http://master.example.org:60010</filename>.</para>
 
+         
 
-    <section xml:id="example_config"><title>Example Configurations</title>
-    <section><title>Basic Distributed HBase Install</title>
-    <para>Here is an example basic configuration for a distributed ten node cluster.
-    The nodes are named <varname>example0</varname>, <varname>example1</varname>, etc., through
-node <varname>example9</varname>  in this example.  The HBase Master and the HDFS namenode 
-are running on the node <varname>example0</varname>.  RegionServers run on nodes
-<varname>example1</varname>-<varname>example9</varname>.
-A 3-node ZooKeeper ensemble runs on <varname>example1</varname>,
-<varname>example2</varname>, and <varname>example3</varname> on the
-default ports. ZooKeeper data is persisted to the directory
-<filename>/export/zookeeper</filename>.
-Below we show what the main configuration files
--- <filename>hbase-site.xml</filename>, <filename>regionservers</filename>, and
-<filename>hbase-env.sh</filename> -- found in the HBase
-<filename>conf</filename> directory might look like.
-</para>
-    <section xml:id="hbase_site"><title><filename>hbase-site.xml</filename></title>
-    <programlisting>
-<![CDATA[
-<?xml version="1.0"?>
-<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-<configuration>
-  <property>
-    <name>hbase.zookeeper.quorum</name>
-    <value>example1,example2,example3</value>
-    <description>The directory shared by region servers.
-    </description>
-  </property>
-  <property>
-    <name>hbase.zookeeper.property.dataDir</name>
-    <value>/export/zookeeper</value>
-    <description>Property from ZooKeeper's config zoo.cfg.
+        <para>Once HBase has started, see the <link
+        linkend="shell_exercises">Shell Exercises</link> section for how to
+        create tables, add data, scan your insertions, and finally disable and
+        drop your tables.</para>
+
+         
+
+        <para>To stop HBase after exiting the HBase shell enter
+        <programlisting>$ ./bin/stop-hbase.sh
+stopping hbase...............</programlisting> Shutdown can take a moment to
+        complete. It can take longer if your cluster is comprised of many
+        machines. If you are running a distributed operation, be sure to wait
+        until HBase has shut down completely before stopping the Hadoop
+        daemons.</para>
+
+         
+      </section>
+    </section>
+
+    <section xml:id="example_config">
+      <title>Example Configurations</title>
+
+      <section>
+        <title>Basic Distributed HBase Install</title>
+
+        <para>Here is an example basic configuration for a distributed ten
+        node cluster. The nodes are named <varname>example0</varname>,
+        <varname>example1</varname>, etc., through node
+        <varname>example9</varname> in this example. The HBase Master and the
+        HDFS namenode are running on the node <varname>example0</varname>.
+        RegionServers run on nodes
+        <varname>example1</varname>-<varname>example9</varname>. A 3-node
+        ZooKeeper ensemble runs on <varname>example1</varname>,
+        <varname>example2</varname>, and <varname>example3</varname> on the
+        default ports. ZooKeeper data is persisted to the directory
+        <filename>/export/zookeeper</filename>. Below we show what the main
+        configuration files -- <filename>hbase-site.xml</filename>,
+        <filename>regionservers</filename>, and
+        <filename>hbase-env.sh</filename> -- found in the HBase
+        <filename>conf</filename> directory might look like.</para>
+
+        <section xml:id="hbase_site">
+          <title><filename>hbase-site.xml</filename></title>
+
+          <programlisting>
+
+&lt;?xml version="1.0"?&gt;
+&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
+&lt;configuration&gt;
+  &lt;property&gt;
+    &lt;name&gt;hbase.zookeeper.quorum&lt;/name&gt;
+    &lt;value&gt;example1,example2,example3&lt;/value&gt;
+    &lt;description&gt;The directory shared by region servers.
+    &lt;/description&gt;
+  &lt;/property&gt;
+  &lt;property&gt;
+    &lt;name&gt;hbase.zookeeper.property.dataDir&lt;/name&gt;
+    &lt;value&gt;/export/zookeeper&lt;/value&gt;
+    &lt;description&gt;Property from ZooKeeper's config zoo.cfg.
     The directory where the snapshot is stored.
-    </description>
-  </property>
-  <property>
-    <name>hbase.rootdir</name>
-    <value>hdfs://example0:9000/hbase</value>
-    <description>The directory shared by region servers.
-    </description>
-  </property>
-  <property>
-    <name>hbase.cluster.distributed</name>
-    <value>true</value>
-    <description>The mode the cluster will be in. Possible values are
+    &lt;/description&gt;
+  &lt;/property&gt;
+  &lt;property&gt;
+    &lt;name&gt;hbase.rootdir&lt;/name&gt;
+    &lt;value&gt;hdfs://example0:9000/hbase&lt;/value&gt;
+    &lt;description&gt;The directory shared by region servers.
+    &lt;/description&gt;
+  &lt;/property&gt;
+  &lt;property&gt;
+    &lt;name&gt;hbase.cluster.distributed&lt;/name&gt;
+    &lt;value&gt;true&lt;/value&gt;
+    &lt;description&gt;The mode the cluster will be in. Possible values are
       false: standalone and pseudo-distributed setups with managed Zookeeper
       true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
-    </description>
-  </property>
-</configuration>
-]]>
+    &lt;/description&gt;
+  &lt;/property&gt;
+&lt;/configuration&gt;
+
     </programlisting>
-    </section>
+        </section>
+
+        <section xml:id="regionservers">
+          <title><filename>regionservers</filename></title>
 
-    <section xml:id="regionservers"><title><filename>regionservers</filename></title>
-    <para>In this file you list the nodes that will run regionservers.  In
-    our case we run regionservers on all but the head node
-    <varname>example1</varname> which is
-    carrying the HBase Master and the HDFS namenode</para>
-    <programlisting>
+          <para>In this file you list the nodes that will run regionservers.
+          In our case we run regionservers on all but the head node
+          <varname>example1</varname> which is carrying the HBase Master and
+          the HDFS namenode</para>
+
+          <programlisting>
     example1
     example3
     example4
@@ -814,15 +919,18 @@ Below we show what the main configuratio
     example8
     example9
     </programlisting>
-    </section>
+        </section>
 
-    <section xml:id="hbase_env"><title><filename>hbase-env.sh</filename></title>
-    <para>Below we use a <command>diff</command> to show the differences from 
-    default in the <filename>hbase-env.sh</filename> file. Here we are setting
-the HBase heap to be 4G instead of the default 1G.
-    </para>
-    <programlisting>
-    <![CDATA[
+        <section xml:id="hbase_env">
+          <title><filename>hbase-env.sh</filename></title>
+
+          <para>Below we use a <command>diff</command> to show the differences
+          from default in the <filename>hbase-env.sh</filename> file. Here we
+          are setting the HBase heap to be 4G instead of the default
+          1G.</para>
+
+          <programlisting>
+    
 $ git diff hbase-env.sh
 diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
 index e70ebc6..96f8c27 100644
@@ -837,18 +945,14 @@ index e70ebc6..96f8c27 100644
  
  # Extra Java runtime options.
  # Below are what we set by default.  May only work with SUN JVM.
-]]>
-    </programlisting>
 
-    <para>Use <command>rsync</command> to copy the content of
-    the <filename>conf</filename> directory to
-    all nodes of the cluster.
-    </para>
-    </section>
+    </programlisting>
 
+          <para>Use <command>rsync</command> to copy the content of the
+          <filename>conf</filename> directory to all nodes of the
+          cluster.</para>
+        </section>
+      </section>
     </section>
-    
-    </section>
-    </section>
-
-  </chapter>
+  </section>
+</chapter>

Modified: hbase/trunk/src/docbkx/performance.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/performance.xml?rev=1085261&r1=1085260&r2=1085261&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/performance.xml (original)
+++ hbase/trunk/src/docbkx/performance.xml Fri Mar 25 06:19:18 2011
@@ -1,39 +1,168 @@
-<?xml version="1.0"?>
-<chapter xml:id="performance"
-      version="5.0" xmlns="http://docbook.org/ns/docbook"
-      xmlns:xlink="http://www.w3.org/1999/xlink"
-      xmlns:xi="http://www.w3.org/2001/XInclude"
-      xmlns:svg="http://www.w3.org/2000/svg"
-      xmlns:m="http://www.w3.org/1998/Math/MathML"
-      xmlns:html="http://www.w3.org/1999/xhtml"
-      xmlns:db="http://docbook.org/ns/docbook">
-    
-    <title>Performance Tuning</title>
-    <para>Start with the <link xlink:href="http://wiki.apache.org/hadoop/PerformanceTuning">wiki Performance Tuning</link> page.
-        It has a general discussion of the main factors involved; RAM, compression, JVM settings, etc.
-        Afterward, come back here for more pointers.
-    </para>
-    <section xml:id="jvm">
-        <title>Java</title>
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter version="5.0" xml:id="performance"
+         xmlns="http://docbook.org/ns/docbook"
+         xmlns:xlink="http://www.w3.org/1999/xlink"
+         xmlns:xi="http://www.w3.org/2001/XInclude"
+         xmlns:svg="http://www.w3.org/2000/svg"
+         xmlns:m="http://www.w3.org/1998/Math/MathML"
+         xmlns:html="http://www.w3.org/1999/xhtml"
+         xmlns:db="http://docbook.org/ns/docbook">
+  <title>Performance Tuning</title>
+
+  <para>Start with the <link
+  xlink:href="http://wiki.apache.org/hadoop/PerformanceTuning">wiki
+  Performance Tuning</link> page. It has a general discussion of the main
+  factors involved; RAM, compression, JVM settings, etc. Afterward, come back
+  here for more pointers.</para>
+
+  <section xml:id="jvm">
+    <title>Java</title>
+
     <section xml:id="gc">
-        <title>The Garage Collector and HBase</title>
-        <section xml:id="gcpause">
-            <title>Long GC pauses</title>
-        <para>
-            In his presentation,
-            <link xlink:href="http://www.slideshare.net/cloudera/hbase-hug-presentation">Avoiding Full GCs with MemStore-Local Allocation Buffers</link>,
-            Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading;
-            CMS failure modes and old generation heap fragmentation brought.  To address the first,
-            start the CMS earlier than default by adding <code>-XX:CMSInitiatingOccupancyFraction</code>
-            and setting it down from defaults.  Start at 60 or 70 percent (The lower you bring down
-            the threshold, the more GCing is done, the more CPU used).  To address the second
-            fragmentation issue, Todd added an experimental facility that must be 
-            explicitly enabled in HBase 0.90.x (Its defaulted to be on in 0.92.x HBase).  See
-            <code>hbase.hregion.memstore.mslab.enabled</code> to true in your
-            <classname>Configuration</classname>.  See the cited slides for background and
-            detail.
-        </para>
+      <title>The Garage Collector and HBase</title>
+
+      <section xml:id="gcpause">
+        <title>Long GC pauses</title>
+
+        <para>In his presentation, <link
+        xlink:href="http://www.slideshare.net/cloudera/hbase-hug-presentation">Avoiding
+        Full GCs with MemStore-Local Allocation Buffers</link>, Todd Lipcon
+        describes two cases of stop-the-world garbage collections common in
+        HBase, especially during loading; CMS failure modes and old generation
+        heap fragmentation brought. To address the first, start the CMS
+        earlier than default by adding
+        <code>-XX:CMSInitiatingOccupancyFraction</code> and setting it down
+        from defaults. Start at 60 or 70 percent (The lower you bring down the
+        threshold, the more GCing is done, the more CPU used). To address the
+        second fragmentation issue, Todd added an experimental facility that
+        must be explicitly enabled in HBase 0.90.x (Its defaulted to be on in
+        0.92.x HBase). See <code>hbase.hregion.memstore.mslab.enabled</code>
+        to true in your <classname>Configuration</classname>. See the cited
+        slides for background and detail.</para>
       </section>
     </section>
+  </section>
+
+  <section xml:id="perf.configurations">
+    <title>Configurations</title>
+
+    <para>See the section on <link
+    linkend="recommended_configurations">recommended
+    configurations</link>.</para>
+
+    <section xml:id="perf.number.of.regions">
+      <title>Number of Regions</title>
+
+      <para>The number of regions for an HBase table is driven by the <link
+      linkend="bigger.regions">filesize</link>. Also, see the architecture
+      section on <link linkend="arch.regions.size">region size</link></para>
     </section>
-  </chapter>
+
+    <section xml:id="perf.compactions.and.splits">
+      <title>Managing Compactions</title>
+
+      <para>For larger systems, managing <link
+      linkend="disable.splitting">compactions and splits</link> may be
+      something you want to consider.</para>
+    </section>
+
+    <section xml:id="perf.compression">
+      <title>Compression</title>
+
+      <para>Production systems should use compression such as <link
+      linkend="lzo">LZO</link> compression with their column family
+      definitions.</para>
+    </section>
+  </section>
+
+  <section xml:id="perf.number.of.cfs">
+    <title>Number of Column Families</title>
+
+    <para>See the section on <link linkend="number.of.cfs">Number of Column
+    Families</link>.</para>
+  </section>
+
+  <section xml:id="perf.one.region">
+    <title>Data Clumping</title>
+
+    <para>If all your data is being written to one region, then re-read the
+    section on processing <link linkend="timeseries">timeseries</link>
+    data.</para>
+  </section>
+
+  <section xml:id="perf.batch.loading">
+    <title>Batch Loading</title>
+
+    <para>See the section on <link linkend="precreate.regions">Pre Creating
+    Regions</link> as well as bulk loading</para>
+  </section>
+
+  <section>
+    <title>HBase Client</title>
+
+    <section xml:id="perf.hbase.client.autoflush">
+      <title>AutoFlush</title>
+
+      <para>When performing a lot of Puts, make sure that setAutoFlush is set
+      to false on <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>
+      instance. Otherwise, the Puts will be sent one at a time to the
+      regionserver. Puts added via... <programlisting>
+htable.add(Put);
+</programlisting> ... and ... <programlisting>
+htable.add( &lt;List&gt; Put);
+</programlisting> ... wind up in the same write buffer. If autoFlush=false,
+      these messages are not sent until the write-buffer is filled. To
+      explicitly flush the messages, call .flushCommits(). Calling .close() on
+      the htable instance will invoke flushCommits().</para>
+    </section>
+
+    <section xml:id="perf.hbase.client.caching">
+      <title>Scan Caching</title>
+
+      <para>If HBase is used as an input source for a MapReduce job, for
+      example, make sure that the input <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
+      instance to the MapReduce job has setCaching set to something greater
+      than the default (which is 1). Using the default value means that the
+      map-task will make call back to the region-server for every record
+      processed. Setting this value to 500, for example, will transfer 500
+      rows at a time to the client to be processed. There is a cost/benefit to
+      have the cache value be large because it costs more in memory for both
+      client and regionserver, so bigger isn't always better.</para>
+    </section>
+
+    <section xml:id="perf.hbase.client.scannerclose">
+      <title>Close ResultScanners</title>
+
+      <para>This isn't so much about improving performance but rather
+      <emphasis>avoiding</emphasis> performance problems. If you forget to
+      close <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html">ResultScanners</link>
+      you can cause problems on the regionservers. Always have ResultScanner
+      processing enclosed in try/catch blocks... <programlisting>
+Scan scan = new Scan();
+// set attrs...
+ResultScanner rs = htable.getScanner(scan);
+try {
+  for (Result r = rs.next(); r != null; r = rs.next()) {
+  // process result...
+} finally {
+  rs.close();  // always close the ResultScanner!
+}
+htable.close();
+</programlisting></para>
+    </section>
+
+    <section xml:id="perf.hbase.client.blockcache">
+      <title>Block Cache</title>
+
+      <para><link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
+      instances can be set to use the block cache in the region server via the
+      setCacheBlocks method. For input Scans to MapReduce jobs, this should be
+      false. For frequently access rows, it is advisable to use the block
+      cache.</para>
+    </section>
+  </section>
+</chapter>



Mime
View raw message