Return-Path: Delivered-To: apmail-lucene-hadoop-commits-archive@locus.apache.org Received: (qmail 57136 invoked from network); 30 Jun 2007 17:00:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jun 2007 17:00:49 -0000 Received: (qmail 21765 invoked by uid 500); 30 Jun 2007 17:00:52 -0000 Delivered-To: apmail-lucene-hadoop-commits-archive@lucene.apache.org Received: (qmail 21753 invoked by uid 500); 30 Jun 2007 17:00:52 -0000 Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-commits@lucene.apache.org Received: (qmail 21744 invoked by uid 99); 30 Jun 2007 17:00:52 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2007 10:00:52 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2007 10:00:48 -0700 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id D15475A251 for ; Sat, 30 Jun 2007 17:00:28 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: hadoop-commits@lucene.apache.org Date: Sat, 30 Jun 2007 17:00:28 -0000 Message-ID: <20070630170028.19878.89991@eos.apache.org> Subject: [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by JimKellerman: http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture ------------------------------------------------------------------------------ [[Anchor(status)]] = Current Status = - As of this writing (2007/05/30), there are approximately 11,500 lines of code in + As of this writing (2007/06/30), there are approximately 16,500 lines of code in "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk. - There are also about 2800 lines of test cases. + There are also about 4000 lines of test cases. All of the single-machine operations (safe-committing, merging, splitting, versioning, flushing, compacting, log-recovery) are complete, have been tested, and seem to work great. The multi-machine stuff (the HMaster, the H!RegionServer, and the - HClient) are in the process of being debugged. + HClient) are actively being enhanced and debugged. Other related features and TODOs: - 1. We need easy interfaces to !MapReduce jobs, so they can scan tables. We have been contacted by Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM Almaden Research who expressed an interest in working on an HBase interface to Hadoop map/reduce. - 1. Vuk Ercegovac also pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed. HBase writes edits to logs and to a memcache. The 'atomic' write to the log is meant to serve as insurance against abnormal !RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if !RegionServer exits without running its 'close'. + 1. Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM Almaden Research pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed. HBase writes edits to logs and to a memcache. The 'atomic' write to the log is meant to serve as insurance against abnormal !RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if !RegionServer exits without running its 'close'. 1. The HMemcache lookup structure is relatively inefficient 1. File compaction is relatively slow; we should have a more conservative algorithm for deciding when to apply compaction. Same for region splits. 1. For the getFull() operation, use of Bloom filters would speed things up (See [https://issues.apache.org/jira/browse/HADOOP-1415 HADOOP-1415])