Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Apache Wiki <wikidiffs@apache.org>
To: hadoop-commits@lucene.apache.org
Date: Sat, 30 Jun 2007 17:00:28 -0000
Message-ID: <20070630170028.19878.89991@eos.apache.org>
Subject: [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by
 JimKellerman

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
  [[Anchor(status)]]
  = Current Status =
  
- As of this writing (2007/05/30), there are approximately 11,500 lines of code in 
+ As of this writing (2007/06/30), there are approximately 16,500 lines of code in 
  "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk.
  
- There are also about 2800 lines of test cases.
+ There are also about 4000 lines of test cases.
  
  All of the single-machine operations (safe-committing, merging,
  splitting, versioning, flushing, compacting, log-recovery) are
  complete, have been tested, and seem to work great.
  
  The multi-machine stuff (the HMaster, the H!RegionServer, and the
- HClient) are in the process of being debugged.
+ HClient) are actively being enhanced and debugged.
  
  Other related features and TODOs:
-  1. We need easy interfaces to !MapReduce jobs, so they can scan tables. We have been contacted by Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM Almaden Research who expressed an interest in working on an HBase interface to  Hadoop map/reduce.
-  1. Vuk Ercegovac also pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed.  HBase writes edits to logs and to a memcache.  The 'atomic' write to the log is meant to serve as insurance against abnormal !RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if !RegionServer exits without running its 'close'.
+  1. Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM Almaden Research pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed.  HBase writes edits to logs and to a memcache.  The 'atomic' write to the log is meant to serve as insurance against abnormal !RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if !RegionServer exits without running its 'close'.
   1. The HMemcache lookup structure is relatively inefficient
   1. File compaction is relatively slow; we should have a more conservative algorithm for deciding when to apply compaction.  Same for region splits.
   1. For the getFull() operation, use of Bloom filters would speed things up (See [https://issues.apache.org/jira/browse/HADOOP-1415 HADOOP-1415])