hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Trivial Update of "Hbase/FAQ" by stack
Date Fri, 28 Dec 2007 18:40:49 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by stack:
http://wiki.apache.org/lucene-hadoop/Hbase/FAQ

The comment on the change is:
New performance faq addition

------------------------------------------------------------------------------
+ == Questions ==
- [[Anchor(1)]]
- [[Anchor(2)]]
- [[Anchor(3)]]
  
- '''1. [#1 Can someone give an example of basic API-usage going against hbase?]'''
+  1. [#1 Can someone give an example of basic API-usage going against hbase?]
+  1. [#2 What other hbase-like applications are there out there?]
+  1. [#3 Can I fix O!utOfMemoryExceptions in hbase?]
+  1. [#4 How do I enable hbase DEBUG-level logging?]
+  1. [#5 Why do I see "java.io.IOException...(Too many open files)" in my logs?]
+  1. [#6 What can I do to improve hbase performance?]
+ 
+ == Answers ==
+ 
+ '''1. [[Anchor(1)]] Can someone give an example of basic API-usage going against hbase?'''
  
  The two main client-side entry points are [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/HBaseAdmin.html
HBaseAdmin] and [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/HTable.html
HTable].  Use H!BaseAdmin to create, drop, list, enable and disable tables.  Use it also to
add and drop table column families.  For adding, updating and deleting data, use HTable. 
Here is some pseudo code absent error checking, imports, etc., that creates a table, adds
data, does a fetch of just-added data and then deletes the table.
  
@@ -39, +46 @@

  
  For further examples, check out the hbase unit tests.  These are probably your best source
for sample code.  Start with the code in org.apache.hadoop.hbase.TestH!BaseCluster.  It does
a general table setup and then performs various client operations on the created table: loading,
scanning, deleting, etc.
  
- Don't forget your client will need a running hbase instance to connect to (See the ''Getting
Started'' section toward the end of this 
+ Don't forget your client will need a running hbase instance to connect to (See the ''Getting
Started'' section toward the end of this
- [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/package-summary.html#package_description
Hbase Package Summary] page). 
+ [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/package-summary.html#package_description
Hbase Package Summary] page).
  
- '''2. [#2 What other hbase-like applications are there out there?]'''
+ '''2. [[Anchor(2)]] What other hbase-like applications are there out there?'''
  
  Apart from Google's bigtable, here are ones we know of:
   * [wiki:Hbase/PNUTS PNUTS], a Platform for Nimble Universal Table Storage, being developed
internally at Yahoo!
   * [http://www.amazon.com/gp/browse.html?node=342335011 Amazon SimpleDB] is a web service
for running queries on structured data in real time.
  
- '''3. [#3 Can I fix O!utOfMemoryExceptions in hbase?]'''
+ '''3. [[Anchor(3)]] Can I fix O!utOfMemoryExceptions in hbase?'''
  Out-of-the-box, hbase uses the default JVM heap size.  Set the ''HBASE_HEAPSIZE'' environment
variable in ''${HBASE_HOME}/conf/hbase-env.sh'' if your install needs to run with a larger
heap.  ''HBASE_HEAPSIZE'' is like ''HADOOP_HEAPSIZE'' in that its value is the desired heap
size in MB.  The surrounding '-Xmx' and 'm' needed to make up the maximum heap size java option
are added by the hbase start script (See how ''HBASE_HEAPSIZE'' is used in the ''${HBASE_HOME}/bin/hbase''
script for clarification).
  
- '''4. [#4 How do I enable hbase DEBUG-level logging?]'''
+ '''4. [[Anchor(4)]] How do I enable hbase DEBUG-level logging?'''
  
  Either add the following line to your log4j.properties file -- ''log4j.logger.org.apache.hadoop.hbase=DEBUG''
-- and restart your cluster or, if running a post-0.15.x version, you can set DEBUG via the
UI by clicking on the 'Log Level' link.
  
- '''5. [#5 Why do I see "java.io.IOException...(Too many open files)" in my logs?]'''
+ '''5. [[Anchor(5)]] Why do I see "java.io.IOException...(Too many open files)" in my logs?'''
  
  Running an Hbase loaded w/ more than a few regions, its possible to blow past the environment
file handle limit for the user running the process.  Running out of file handles is like an
OOME, things start to fail in strange ways.  To up the users' file handles, edit '''/etc/security/limits.conf'''
on all nodes and restart your cluster.
  
+ '''6. [[Anchor(6)]] Performance?'''
+ 
+ To improve random-read performance, if you can, try making the hdfs block size smaller (as
is suggested in the bigtable paper).  By default its 64MB.  Try setting it to 8MB.  On every
random read, hbase has to fetch from hdfs the blocks that contain the wanted row.  If your
rows are small, much smaller than the hdfs block size, then we'll be fetching a lot of data
only to discard the bulk.  Meantime the big block fetches and processing consume CPU, network,
etc. in the datanodes and hbase client.
+ 
+ Another configuration that can help with random reads at some cost in memory is making the
'''hbase.io.index.interval''' smaller.  By default when hbase writes store files, it adds
an entry to the mapfile index on every 32nd addition (For hadoop, default is every 128th addition).
 Adding entries more frequently -- every 16th or every 8th -- will make it so there is less
seeking around looking for the wanted entry but at the cost of a hbase carrying a larger index
(Indices are read into memory on mapfile open; by default there are one to five or so mapfiles
per column family per region loaded into a regionserver).
+ 
+ Some basic tests making the '''io.bytes.per.checksum''' larger -- changing it from checksum-checking
every 4096 bytes instead of every 512 bytes -- seem to have no discernible effect on performance.
+ 

Mime
View raw message