hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From st...@apache.org
Subject svn commit: r1204039 [3/3] - in /hbase/branches/0.92/src/docbkx: book.xml configuration.xml developer.xml ops_mgt.xml performance.xml troubleshooting.xml
Date Sat, 19 Nov 2011 18:43:38 GMT
Modified: hbase/branches/0.92/src/docbkx/configuration.xml
URL: http://svn.apache.org/viewvc/hbase/branches/0.92/src/docbkx/configuration.xml?rev=1204039&r1=1204038&r2=1204039&view=diff
--- hbase/branches/0.92/src/docbkx/configuration.xml (original)
+++ hbase/branches/0.92/src/docbkx/configuration.xml Sat Nov 19 18:43:37 2011
@@ -210,42 +210,52 @@ to ensure well-formedness of your docume
               This version of HBase will only run on <link
-        0.20.x</link>. It will not run on hadoop 0.21.x (nor 0.22.x).
+        0.20.x</link>. It will not run on hadoop 0.21.x (but may run on 0.22.x/0.23.x).
         HBase will lose data unless it is running on an HDFS that has a durable
-        <code>sync</code>. Hadoop 0.20.2 and Hadoop DO NOT have this
-        Currently only the <link
-        xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
-        branch has this a working sync<footnote>
-            <para>See <link
-            xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</link>
-            in branch-0.20-append to see list of patches involved adding
-            append on the Hadoop 0.20 branch.</para>
-          </footnote>. No official releases have been made from the branch-0.20-append
branch up
-        to now so you will have to build your own Hadoop from the tip of this
-        branch.  Michael Noll has written a detailed blog,
+        <code>sync</code>. Hadoop 0.20.2, Hadoop, and Hadoop
+	DO NOT have this attribute.
+        Currently only Hadoop versions 0.20.205.x or any release in excess of this
+        version has a durable sync.  You have to explicitly enable it though by
+        setting <varname>dfs.support.append</varname> equal to true on both
+        the client side -- in <filename>hbase-site.xml</filename> though it should
+        be on in your <filename>base-default.xml</filename> file -- and on the
+        serverside in <filename>hdfs-site.xml</filename> (You will have to restart
+        your cluster after setting this configuration).  Ignore the chicken-little
+        comment you'll find in the <filename>hdfs-site.xml</filename> in the
+        description for this configuration; it says it is not enabled because there
+        are <quote>... bugs in the 'append code' and is not supported in any production
+        cluster.</quote> because it is not true (I'm sure there are bugs but the
+        append code has been running in production at large scale deploys and is on
+        by default in the offerings of hadoop by commercial vendors)
+        <footnote><para>Until recently only the
+        <link xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
+        branch had a working sync but no official release was ever made from this branch.
+        You had to build it yourself. Michael Noll wrote a detailed blog,
         <link xlink:href="http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/">Building
         an Hadoop 0.20.x version for HBase 0.90.2</link>, on how to build an
-    Hadoop from branch-0.20-append.  Recommended <footnote><para>Praveen Kumar
has written
+    Hadoop from branch-0.20-append.  Recommended.</para></footnote>
+    <footnote><para>Praveen Kumar has written
             a complimentary article,
             <link xlink:href="http://praveen.kumar.in/2011/06/20/building-hadoop-and-hbase-for-hbase-maven-application-development/">Building
Hadoop and HBase for HBase Maven application development</link>.
+</para></footnote><footnote>Cloudera have <varname>dfs.support.append</varname>
set to true by default.</footnote>.</para>
-<para>Or rather than build your own, you could use the
+<para>Or use the
     <link xlink:href="http://www.cloudera.com/">Cloudera</link> or
     <link xlink:href="http://www.mapr.com/">MapR</link> distributions.
     Cloudera' <link xlink:href="http://archive.cloudera.com/docs/">CDH3</link>
-    is Apache Hadoop 0.20.x plus patches including all of the 0.20-append additions
-    needed to add a durable sync. Use the released version of CDH3 at least (They
-    have just posted an update).  MapR includes a commercial, reimplementation of HDFS.
+    is Apache Hadoop 0.20.x plus patches including all of the 
+    <link xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
+    additions needed to add a durable sync. Use the released, most recent version of CDH3.</para>
+    <para>
+    <link xlink:href="http://www.mapr.com/">MapR</link>
+    includes a commercial, reimplementation of HDFS.
     It has a durable sync as well as some other interesting features that are not
     yet in Apache Hadoop.  Their <link xlink:href="http://www.mapr.com/products/mapr-editions/m3-edition">M3</link>
     product is free to use and unlimited.
         <para>Because HBase depends on Hadoop, it bundles an instance of the
-        Hadoop jar under its <filename>lib</filename> directory. The bundled
-        Hadoop was made from the Apache branch-0.20-append branch at the time
-        of the HBase's release.  The bundled jar is ONLY for use in standalone mode.
+        Hadoop jar under its <filename>lib</filename> directory. The bundled
jar is ONLY for use in standalone mode.
         In distributed mode, it is <emphasis>critical</emphasis> that the version
of Hadoop that is out
         on your cluster match what is under HBase.  Replace the hadoop jar found in the HBase
         <filename>lib</filename> directory with the hadoop jar you are running
@@ -1114,7 +1124,16 @@ of all regions.
+      <section xml:id="other_configuration"><title>Other Configurations</title>
+         <section xml:id="balancer_config"><title>Balancer</title>
+           <para>The balancer is periodic operation run on the master to redistribute
regions on the cluster.  It is configured via
+           <varname>hbase.balancer.period</varname> and defaults to 300000 (5
minutes). </para>
+           <para>See <xref linkend="master.processes.loadbalancer" /> for more
information on the LoadBalancer.
+           </para>
+         </section>
+      </section> <!--  important config -->
 	  <section xml:id="config.bloom">
 	    <title>Bloom Filter Configuration</title>

Modified: hbase/branches/0.92/src/docbkx/developer.xml
URL: http://svn.apache.org/viewvc/hbase/branches/0.92/src/docbkx/developer.xml?rev=1204039&r1=1204038&r2=1204039&view=diff
--- hbase/branches/0.92/src/docbkx/developer.xml (original)
+++ hbase/branches/0.92/src/docbkx/developer.xml Sat Nov 19 18:43:37 2011
@@ -26,8 +26,9 @@
  * limitations under the License.
-    <title>Developing HBase</title>
-    <para>This chapter will be of interest only to those developing HBase (i.e., as
opposed to using it).
+    <title>Building and Developing HBase</title>
+    <para>This chapter will be of interest only to those building and developing HBase
(i.e., as opposed to
+    just downloading the latest distribution).
     <section xml:id="repos">
       <title>HBase Repositories</title>
@@ -43,7 +44,8 @@ svn co http://svn.apache.org/repos/asf/h
 git clone git://git.apache.org/hbase.git
-    </section>         
+    </section>    
     <section xml:id="ides"> 
         <section xml:id="eclipse">
@@ -115,11 +117,56 @@ Access restriction: The method getLong(O
+        <section xml:id="build">
+       <title>Building HBase</title>
+      <para>This section will be of interest only to those building HBase from source.
+      </para>
+      <section xml:id="build.snappy">
+        <title>Building in snappy compression support</title>
+        <para>Pass <code>-Dsnappy</code> to trigger the snappy maven profile
for building
+            snappy native libs into hbase.</para>
+      </section>
+      <section xml:id="mvn_repo">
+        <title>Adding an HBase release to Apache's Maven Repository</title>
+        <para>Follow the instructions at
+        <link xlink:href="http://www.apache.org/dev/publishing-maven-artifacts.html">Publishing
Maven Artifacts</link>.
+            The 'trick' to making it all work is answering the questions put to you by the
mvn release plugin properly,
+            making sure it is using the actual branch AND before doing the <command>mvn
release:perform</command> step,
+            VERY IMPORTANT, hand edit the release.properties file that was put under <varname>${HBASE_HOME}</varname>
+            by the previous step, <command>release:perform</command>. You need
to edit it to make it point at
+            right locations in SVN.
+        </para>
+        <para>If you see run into the below, its because you need to edit version in
the pom.xml and add
+        <code>-SNAPSHOT</code> to the version (and commit).
+        <programlisting>[INFO] Scanning for projects...
+[INFO] Searching repository for plugin with prefix: 'release'.
+[INFO] ------------------------------------------------------------------------
+[INFO] Building HBase
+[INFO]    task-segment: [release:prepare] (aggregator-style)
+[INFO] ------------------------------------------------------------------------
+[INFO] [release:prepare {execution: default-cli}]
+[INFO] ------------------------------------------------------------------------
+[INFO] ------------------------------------------------------------------------
+[INFO] You don't have a SNAPSHOT project in the reactor projects list.
+[INFO] ------------------------------------------------------------------------
+[INFO] For more information, run Maven with the -e switch
+[INFO] ------------------------------------------------------------------------
+[INFO] Total time: 3 seconds
+[INFO] Finished at: Sat Mar 26 18:11:07 PDT 2011
+[INFO] Final Memory: 35M/423M
+[INFO] -----------------------------------------------------------------------</programlisting>
+        </para>
+      </section>
+    </section> <!--  build -->
     <section xml:id="maven.build.commands"> 
        <title>Maven Build Commands</title>
        <para>All commands executed from the local HBase project directory.
-       <para>Note:  use Maven 2, not Maven 3.
+       <para>Note: use Maven 3 (Maven 2 may work but we suggest you use Maven 3).
        <section xml:id="maven.build.commands.compile"> 
@@ -139,13 +186,34 @@ mvn test
 mvn test -Dtest=TestXYZ
+       <section xml:id="maven.build.commands.unit2"> 
+          <title>Run a Few Unit Tests</title>
+          <programlisting>
+mvn test -Dtest=TestXYZ,TestABC
+          </programlisting>
+       </section>       
+       <section xml:id="maven.build.commands.unit.package"> 
+          <title>Run all Unit Tests for a Package</title>
+          <programlisting>
+mvn test -Dtest=org.apache.hadoop.hbase.client.*
+          </programlisting>
+       </section>
+       <section xml:id="maven.build.commanas.integration.tests"> 
+          <title>Integration Tests</title>
+          <para>HBase 0.92 added a <varname>verify</varname> maven target.
Invoking it with run all the phases up to and including the verify phase via the maven <link
xlink:href="http://maven.apache.org/plugins/maven-failsafe-plugin/">failsafe plugin</link>,
running all the unit tests as well as the long running unit and integration tests.
+          </para>
+          <programlisting>
+mvn verify
+          </programlisting>
+      </section>
     <section xml:id="getting.involved"> 
         <title>Getting Involved</title>
         <para>HBase gets better only when people contribute!
+        <para>As HBase is an Apache Software Foundation project, see <xref linkend="asf"/>
for more information about how the ASF functions.
+        </para>
         <section xml:id="mailing.list">
           <title>Mailing Lists</title>
           <para>Sign up for the dev-list and the user-list.  See the 
@@ -172,7 +240,18 @@ mvn test -Dtest=TestXYZ
-       </section>
+        <section xml:id="submitting.patches.jira.code">
+          <title>Code Blocks in Jira Comments</title>
+          <para>A commonly used macro in Jira is {code}. If you do this in a Jira comment...
+   code snippet
+              ... Jira will format the code snippet like code, instead of a regular comment.
 It improves readability.
+          </para>
+        </section>
+       </section>  <!--  jira -->
       </section>  <!--  getting involved -->
       <section xml:id="developing">
@@ -372,6 +451,9 @@ Bar bar = foo.getBar();     &lt;--- imag
           <para>Larger patches should go through <link xlink:href="http://reviews.apache.org">ReviewBoard</link>.
+          <para>For more information on how to use ReviewBoard, see
+           <link xlink:href="http://www.reviewboard.org/docs/manual/1.5/">the ReviewBoard
+          </para>
         <section xml:id="committing.patches">
           <title>Committing Patches</title>

Modified: hbase/branches/0.92/src/docbkx/ops_mgt.xml
URL: http://svn.apache.org/viewvc/hbase/branches/0.92/src/docbkx/ops_mgt.xml?rev=1204039&r1=1204038&r2=1204039&view=diff
--- hbase/branches/0.92/src/docbkx/ops_mgt.xml (original)
+++ hbase/branches/0.92/src/docbkx/ops_mgt.xml Sat Nov 19 18:43:37 2011
@@ -108,13 +108,26 @@
 --peer.adr=server1,server2,server3:2181:/hbase TestTable</programlisting>
+    <section xml:id="export">
+       <title>Export</title>
+       <para>Export is a utility that will dump the contents of table to HDFS in a
sequence file.  Invoke via:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export &lt;tablename&gt;
&lt;outputdir&gt; [&lt;versions&gt; [&lt;starttime&gt; [&lt;endtime&gt;]]]
+       </para>
+    </section>
+    <section xml:id="import">
+       <title>Import</title>
+       <para>Import is a utility that will load data that has been exported back into
HBase.  Invoke via:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import &lt;tablename&gt;
+       </para>
+    </section>
     <section xml:id="rowcounter">
        <para>RowCounter is a utility that will count all the rows of a table.  This
is a good utility to use
        as a sanity check to ensure that HBase can read all the blocks of a table if there
are any concerns of metadata inconsistency.
-<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter &lt;tablename&gt;
[&lt;column1&gt; &lt;column2&gt;...]
@@ -252,15 +265,163 @@ false
     </section>  <!--  node mgt -->
+  <section xml:id="hbase_metrics">
+  <title>Metrics</title>
+  <section xml:id="metric_setup">
+  <title>Metric Setup</title>
+  <para>See <link xlink:href="http://hbase.apache.org/metrics.html">Metrics</link>
+  an introduction and how to enable Metrics emission.
+  </para>
+  </section>
+   <section xml:id="rs_metrics">
+   <title>RegionServer Metrics</title>
+          <section xml:id="hbase.regionserver.blockCacheCount"><title><varname>hbase.regionserver.blockCacheCount</varname></title>
+          <para>Block cache item count in memory.  This is the number of blocks of
storefiles (HFiles) in the cache.</para>
+		  </section>
+         <section xml:id="hbase.regionserver.blockCacheFree"><title><varname>hbase.regionserver.blockCacheFree</varname></title>
+          <para>Block cache memory available (bytes).</para>
+		  </section>
+         <section xml:id="hbase.regionserver.blockCacheHitRatio"><title><varname>hbase.regionserver.blockCacheHitRatio</varname></title>
+          <para>Block cache hit ratio (0 to 100).  TODO:  describe impact to ratio
where read requests that have cacheBlocks=false</para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheSize"><title><varname>hbase.regionserver.blockCacheSize</varname></title>
+          <para>Block cache size in memory (bytes).  i.e., memory in use by the BlockCache</para>
+		  </section>
+          <section xml:id="hbase.regionserver.compactionQueueSize"><title><varname>hbase.regionserver.compactionQueueSize</varname></title>
+          <para>Size of the compaction queue.  This is the number of stores in the
region that have been targeted for compaction.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsReadLatency_avg_time"><title><varname>hbase.regionserver.fsReadLatency_avg_time</varname></title>
+          <para>Filesystem read latency (ms).  This is the average time to read from
+		  </section>
+          <section xml:id="hbase.regionserver.fsReadLatency_num_ops"><title><varname>hbase.regionserver.fsReadLatency_num_ops</varname></title>
+          <para>TODO</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsSyncLatency_avg_time"><title><varname>hbase.regionserver.fsSyncLatency_avg_time</varname></title>
+          <para>Filesystem sync latency (ms)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsSyncLatency_num_ops"><title><varname>hbase.regionserver.fsSyncLatency_num_ops</varname></title>
+          <para>TODO</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsWriteLatency_avg_time"><title><varname>hbase.regionserver.fsWriteLatency_avg_time</varname></title>
+          <para>Filesystem write latency (ms)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsWriteLatency_num_ops"><title><varname>hbase.regionserver.fsWriteLatency_num_ops</varname></title>
+          <para>TODO</para>
+		  </section>
+          <section xml:id="hbase.regionserver.memstoreSizeMB"><title><varname>hbase.regionserver.memstoreSizeMB</varname></title>
+          <para>Sum of all the memstore sizes in this RegionServer (MB)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.regions"><title><varname>hbase.regionserver.regions</varname></title>
+          <para>Number of regions served by the RegionServer</para>
+		  </section>
+          <section xml:id="hbase.regionserver.requests"><title><varname>hbase.regionserver.requests</varname></title>
+          <para>Total number of read and write requests.  Requests correspond to RegionServer
RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000
will result in 1 request for each 'next' call (i.e., not each row).  A bulk-load request will
constitute 1 request per HFile.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.storeFileIndexSizeMB"><title><varname>hbase.regionserver.storeFileIndexSizeMB</varname></title>
+          <para>Sum of all the storefile index sizes in this RegionServer (MB)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.stores"><title><varname>hbase.regionserver.stores</varname></title>
+          <para>Number of stores open on the RegionServer.  A store corresponds to
a column family.  For example, if a table (which contains the column family) has 3 regions
on a RegionServer, there will be 3 stores open for that column family. </para>
+		  </section>
+          <section xml:id="hbase.regionserver.storeFiles"><title><varname>hbase.regionserver.storeFiles</varname></title>
+          <para>Number of store filles open on the RegionServer.  A store may have
more than one storefile (HFile).</para>
+		  </section>
+   </section>
+  </section>
   <section xml:id="ops.monitoring">
     <title >HBase Monitoring</title>
+  <section xml:id="cluster_replication">
+    <title>Cluster Replication</title>
+    <para>See <link xlink:href="http://hbase.apache.org/replication.html">Cluster
+    </para>
+  </section>
   <section xml:id="ops.backup">
     <title >HBase Backup</title>
-    <para>See <link xlink:href="http://blog.sematext.com/2011/03/11/hbase-backup-options/">HBase
Backup Options</link> over on the Sematext Blog.
+    <para>There are two broad strategies for performing HBase backups: backing up with
a full cluster shutdown, and backing up on a live cluster. 
+    Each approach has pros and cons.   
+    </para>
+    <para>For additional information, see <link xlink:href="http://blog.sematext.com/2011/03/11/hbase-backup-options/">HBase
Backup Options</link> over on the Sematext Blog.
+    <section xml:id="ops.backup.fullshutdown"><title>Full Shutdown Backup</title>
+      <para>Some environments can tolerate a periodic full shutdown of their HBase
cluster, for example if it is being used a back-end analytic capacity
+      and not serving front-end web-pages.  The benefits are that the NameNode/Master are
RegionServers are down, so there is no chance of missing
+      any in-flight changes to either StoreFiles or metadata.  The obvious con is that the
cluster is down.  The steps include:
+      </para>
+      <section xml:id="ops.backup.fullshutdown.stop"><title>Stop HBase</title>
+        <para>
+        </para>
+      </section>
+      <section xml:id="ops.backup.fullshutdown.distcp"><title>Distcp</title>
+        <para>Distcp could be used to either copy the contents of the HBase directory
in HDFS to either the same cluster in another directory, or 
+        to a different cluster.
+        </para>
+        <para>Note:  Distcp works in this situation because the cluster is down and
there are no in-flight edits to files.  
+        Distcp-ing of files in the HBase directory is not generally recommended on a live
+        </para>
+      </section>
+      <section xml:id="ops.backup.fullshutdown.restore"><title>Restore (if needed)</title>
+        <para>The backup of the hbase directory from HDFS is copied onto the 'real'
hbase directory via distcp.  The act of copying these files 
+        creates new HDFS metadata, which is why a restore of the NameNode edits from the
time of the HBase backup isn't required for this kind of
+        restore, because it's a restore (via distcp) of a specific HDFS directory (i.e.,
the HBase part) not the entire HDFS file-system.
+        </para>
+      </section>
+    </section>
+    <section xml:id="ops.backup.live.replication"><title>Live Cluster Backup
- Replication</title>
+      <para>This approach assumes that there is a second cluster.  
+      See the HBase page on <link xlink:href="http://hbase.apache.org/replication.html">replication</link>
for more information.
+      </para>
+    </section>
+    <section xml:id="ops.backup.live.copytable"><title>Live Cluster Backup -
+      <para>The <xref linkend="copytable" /> utility could either be used to
copy data from one table to another on the 
+      same cluster, or to copy data to another table on another cluster.
+      </para>
+      <para>Since the cluster is up, there is a risk that edits could be missed in
the copy process.
+      </para>
+    </section>
+    <section xml:id="ops.backup.live.export"><title>Live Cluster Backup - Export</title>
+      <para>The <xref linkend="export" /> approach dumps the content of a table
to HDFS on the same cluster.  To restore the data, the
+      <xref linkend="import" /> utility would be used.
+      </para>
+      <para>Since the cluster is up, there is a risk that edits could be missed in
the export process.
+      </para>
+    </section>
+  </section>  <!--  backup -->
+  <section xml:id="ops.capacity"><title>Capacity Planning</title>
+    <section xml:id="ops.capacity.storage"><title>Storage</title>
+      <para>A common question for HBase administrators is estimating how much storage
will be required for an HBase cluster.
+      There are several apsects to consider, the most important of which is what data load
into the cluster.  Start
+      with a solid understanding of how HBase handles data internally (KeyValue).
+      </para>
+      <section xml:id="ops.capacity.storage.kv"><title>KeyValue</title>
+        <para>HBase storage will be dominated by KeyValues.  See <xref linkend="keyvalue"
/> and <xref linkend="keysize" /> for 
+        how HBase stores data internally.  
+        </para>
+        <para>It is critical to understand that there is a KeyValue instance for every
attribute stored in a row, and the 
+        rowkey-length, ColumnFamily name-length and attribute lengths will drive the size
of the database more than any other
+        factor.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.sf"><title>StoreFiles and Blocks</title>
+        <para>KeyValue instances are aggregated into blocks, and the blocksize is configurable
on a per-ColumnFamily basis.
+        Blocks are aggregated into StoreFile's.  See <xref linkend="regions.arch" />.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.hdfs"><title>HDFS Block Replication</title>
+        <para>Because HBase runs on top of HDFS, factor in HDFS block replication into
storage calculations.
+        </para>
+      </section>
+    </section>
+    <section xml:id="ops.capacity.regions"><title>Regions</title>
+      <para>Another common question for HBase administrators is determining the right
number of regions per
+      RegionServer.  This affects both storage and hardware planning. See <xref linkend="perf.number.of.regions"
+      </para>
+    </section>

Modified: hbase/branches/0.92/src/docbkx/performance.xml
URL: http://svn.apache.org/viewvc/hbase/branches/0.92/src/docbkx/performance.xml?rev=1204039&r1=1204038&r2=1204039&view=diff
--- hbase/branches/0.92/src/docbkx/performance.xml (original)
+++ hbase/branches/0.92/src/docbkx/performance.xml Sat Nov 19 18:43:37 2011
@@ -140,6 +140,14 @@
       <para>The number of regions for an HBase table is driven by the <xref
               linkend="bigger.regions" />. Also, see the architecture
           section on <xref linkend="arch.regions.size" /></para>
+       <para>A lower number of regions is preferred, generally in the range of 20 to
+       per RegionServer.  Adjust the regionsize as appropriate to achieve this number. 
+       </para>
+       <para>For the 0.90.x codebase, the upper-bound of regionsize is about 4Gb.
+       For 0.92.x codebase, due to the HFile v2 change much larger regionsizes can be supported
(e.g., 20Gb).
+       </para>
+       <para>You may need to experiment with this setting based on your hardware configuration
and application needs.
+       </para>
     <section xml:id="perf.compactions.and.splits">
@@ -150,12 +158,6 @@
       something you want to consider.</para>
-    <section xml:id="perf.compression">
-      <title>Compression</title>
-      <para>Production systems should use compression with their column family definitions.
 See <xref linkend="compression" /> for more information.
-      </para>
-    </section>
     <section xml:id="perf.handlers">
         <para>See <xref linkend="hbase.regionserver.handler.count"/>. 
@@ -213,7 +215,52 @@
       <title>Key and Attribute Lengths</title>
       <para>See <xref linkend="keysize" />.</para>
-  </section>
+    <section xml:id="schema.regionsize"><title>Table RegionSize</title>
+    <para>The regionsize can be set on a per-table basis via <code>setFileSize</code>
+    <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link>
in the 
+    event where certain tables require different regionsizes than the configured default
+    </para>
+    <para>See <xref linkend="perf.number.of.regions"/> for more information.
+    </para>
+    </section>
+    <section xml:id="schema.bloom">
+    <title>Bloom Filters</title>
+    <para>Bloom Filters can be enabled per-ColumnFamily.
+        Use <code>HColumnDescriptor.setBloomFilterType(NONE | ROW |
+        ROWCOL)</code> to enable blooms per Column Family. Default =
+        <varname>NONE</varname> for no bloom filters. If
+        <varname>ROW</varname>, the hash of the row will be added to the bloom
+        on each insert. If <varname>ROWCOL</varname>, the hash of the row +
+        column family + column family qualifier will be added to the bloom on
+        each key insert.</para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>
+    <xref linkend="blooms"/> for more information.
+    </para>
+    </section>
+    <section xml:id="schema.cf.blocksize"><title>ColumnFamily BlockSize</title>
+    <para>The blocksize can be configured for each ColumnFamily in a table, and this
defaults to 64k.  Larger cell values require larger blocksizes. 
+    There is an inverse relationship between blocksize and the resulting StoreFile indexes
(i.e., if the blocksize is doubled then the resulting
+    indexes should be roughly halved).
+    </para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>

+    and <xref linkend="store"/>for more information.
+    </para>
+    </section>
+    <section xml:id="cf.in.memory">
+    <title>In-Memory ColumnFamilies</title>
+    <para>ColumnFamilies can optionally be defined as in-memory.  Data is still persisted
to disk, just like any other ColumnFamily.  
+    In-memory blocks have the highest priority in the <xref linkend="block.cache" />,
but it is not a guarantee that the entire table
+    will be in memory.
+    </para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>
for more information.
+    </para>
+    </section>
+    <section xml:id="perf.compression">
+      <title>Compression</title>
+      <para>Production systems should use compression with their ColumnFamily definitions.
 See <xref linkend="compression" /> for more information.
+      </para>
+    </section>
+  </section>  <!--  perf schema -->
   <section xml:id="perf.writing">
     <title>Writing to HBase</title>
@@ -348,6 +395,18 @@ Deferred log flush can be configured on 
       rows at a time to the client to be processed. There is a cost/benefit to
       have the cache value be large because it costs more in memory for both
       client and RegionServer, so bigger isn't always better.</para>
+      <section xml:id="perf.hbase.client.caching.mr">
+        <title>Scan Caching in MapReduce Jobs</title>
+        <para>Scan settings in MapReduce jobs deserve special attention.  Timeouts
can result (e.g., UnknownScannerException)
+        in Map tasks if it takes longer to process a batch of records before the client goes
back to the RegionServer for the
+        next set of data.  This problem can occur because there is non-trivial processing
occuring per row.  If you process
+        rows quickly, set caching higher.  If you process rows more slowly (e.g., lots of
transformations per row, writes), 
+        then set caching lower.
+        </para>
+        <para>Timeouts can also happen in a non-MapReduce use case (i.e., single threaded
HBase client doing a Scan), but the
+        processing that is often performed in MapReduce jobs tends to exacerbate this issue.
+        </para>
+      </section>
     <section xml:id="perf.hbase.client.selection">
       <title>Scan Attribute Selection</title>
@@ -431,4 +490,35 @@ htable.close();</programlisting></para>
   </section>  <!--  deleting -->
+  <section xml:id="perf.hdfs"><title>HDFS</title>
+   <para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to
understand how it works and how it affects
+   HBase.
+   </para>
+    <section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
+      <para>The original use-case for HDFS was batch processing.  As such, there low-latency
reads were historically not a priority.
+      With the increased adoption of HBase this is changing, and several improvements are
already in development.
+      See the 
+      <link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira
Ticket for HDFS Improvements for HBase</link>.
+      </para>
+    </section>
+    <section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase
vs. HDFS</title>
+     <para>A fairly common question on the dist-list is why HBase isn't as performant
as HDFS files in a batch context (e.g., as 
+     a MapReduce source or sink).  The short answer is that HBase is doing a lot more than
HDFS (e.g., reading the KeyValues, 
+     returning the most current row or specified timestamps, etc.), and as such HBase is
4-5 times slower than HDFS in this 
+     processing context.  Not that there isn't room for improvement (and this gap will, over
time, be reduced), but HDFS
+      will always be faster in this use-case.
+     </para>
+    </section>
+  </section>
+  <section xml:id="perf.ec2"><title>Amazon EC2</title>
+   <para>Performance questions are common on Amazon EC2 environments because it is
a shared environment.  You will
+   not see the same throughput as a dedicated server.  In terms of running tests on EC2,
run them several times for the same
+   reason (i.e., it's a shared environment and you don't know what else is happening on the
+   </para>
+   <para>If you are running on EC2 and post performance questions on the dist-list,
please state this fact up-front that
+    because EC2 issues are practically a separate class of performance issues.
+   </para>
+  </section>

Modified: hbase/branches/0.92/src/docbkx/troubleshooting.xml
URL: http://svn.apache.org/viewvc/hbase/branches/0.92/src/docbkx/troubleshooting.xml?rev=1204039&r1=1204038&r2=1204039&view=diff
--- hbase/branches/0.92/src/docbkx/troubleshooting.xml (original)
+++ hbase/branches/0.92/src/docbkx/troubleshooting.xml Sat Nov 19 18:43:37 2011
@@ -461,13 +461,17 @@ hadoop   17789  155 35.2 9067824 8604364
     <section xml:id="trouble.client">
+       <para>For more information on the HBase client, see <xref linkend="client"/>.

+       </para>
        <section xml:id="trouble.client.scantimeout">
-            <title>ScannerTimeoutException</title>
+            <title>ScannerTimeoutException or UnknownScannerException</title>
             <para>This is thrown if the time between RPC calls from the client to RegionServer
exceeds the scan timeout.  
             For example, if <code>Scan.setCaching</code> is set to 500, then
there will be an RPC call to fetch the next batch of rows every 500 <code>.next()</code>
calls on the ResultScanner
             because data is being transferred in blocks of 500 rows to the client.  Reducing
the setCaching value may be an option, but setting this value too low makes for inefficient
             processing on numbers of rows.
+            <para>See <xref linkend="perf.hbase.client.caching"/>.
+            </para>
        <section xml:id="trouble.client.scarylogs">
             <title>Shell or client application throws lots of scary exceptions during
normal operation</title>
@@ -523,12 +527,16 @@ hadoop   17789  155 35.2 9067824 8604364
     <section xml:id="trouble.namenode">
+       <para>For more information on the NameNode, see <xref linkend="arch.hdfs"/>.

+       </para>
        <section xml:id="trouble.namenode.disk">
             <title>HDFS Utilization of Tables and Regions</title>
             <para>To determine how much space HBase is using on HDFS use the <code>hadoop</code>
shell commands from the NameNode.  For example... </para>
             <para><programlisting>hadoop fs -dus /hbase/</programlisting>
...returns the summarized disk utilization for all HBase objects.  </para>
             <para><programlisting>hadoop fs -dus /hbase/myTable</programlisting>
...returns the summarized disk utilization for the HBase table 'myTable'. </para>
             <para><programlisting>hadoop fs -du /hbase/myTable</programlisting>
...returns a list of the regions under the HBase table 'myTable' and their disk utilization.
+            <para>For more information on HDFS shell commands, see the <link xlink:href="http://hadoop.apache.org/common/docs/current/file_system_shell.html">HDFS
FileSystem Shell documentation</link>.
+            </para>
        <section xml:id="trouble.namenode.hbase.objects">
             <title>Browsing HDFS for HBase Objects</title>
@@ -552,6 +560,9 @@ hadoop   17789  155 35.2 9067824 8604364
                <filename>/&lt;HLog&gt;</filename>           (WAL HLog
files for the RegionServer)
+		    <para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS
User Guide</link> for other non-shell diagnostic 
+		    utilities like <code>fsck</code>. 
+            </para>
           <section xml:id="trouble.namenode.uncompaction">
             <title>Use Cases</title>
               <para>Two common use-cases for querying HDFS for HBase objects is research
the degree of uncompaction of a table.  If there are a large number of StoreFiles for each
ColumnFamily it could 
@@ -565,6 +576,8 @@ hadoop   17789  155 35.2 9067824 8604364
     <section xml:id="trouble.rs">
+        <para>For more information on the RegionServers, see <xref linkend="regionserver.arch"/>.

+       </para>
       <section xml:id="trouble.rs.startup">
         <title>Startup Errors</title>
           <section xml:id="trouble.rs.startup.master-no-region">
@@ -747,6 +760,8 @@ ERROR org.apache.hadoop.hbase.regionserv
     <section xml:id="trouble.master">
+       <para>For more information on the Master, see <xref linkend="master"/>.

+       </para>
       <section xml:id="trouble.master.startup">
         <title>Startup Errors</title>
           <section xml:id="trouble.master.startup.migration">
@@ -812,6 +827,13 @@ ERROR org.apache.hadoop.hbase.regionserv
              <para>Questions on HBase and Amazon EC2 come up frequently on the HBase
dist-list. Search for old threads using <link xlink:href="http://search-hadoop.com/">Search
+          <section xml:id="trouble.ec2.connection">
+             <title>Remote Java Connection into EC2 Cluster Not Working</title>
+             <para>
+             See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote
Java client connection into EC2 instance</link>.
+             </para>
+          </section>

View raw message