Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 68823 invoked from network); 6 Jan 2010 19:37:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Jan 2010 19:37:18 -0000 Received: (qmail 76172 invoked by uid 500); 6 Jan 2010 19:37:17 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 76100 invoked by uid 500); 6 Jan 2010 19:37:17 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 75909 invoked by uid 99); 6 Jan 2010 19:37:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 19:37:17 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 19:37:16 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 578D1234C1F2 for ; Wed, 6 Jan 2010 11:36:56 -0800 (PST) Message-ID: <1717744375.75441262806616357.JavaMail.jira@brutus.apache.org> Date: Wed, 6 Jan 2010 19:36:56 +0000 (UTC) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-2037) Alternate indexed hbase implementation; speeds scans by adding indexes to regions rather secondary tables In-Reply-To: <156634733.1260473298295.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-2037?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1279= 7266#action_12797266 ]=20 stack commented on HBASE-2037: ------------------------------ I made hbase-2092 as blocker on 0.20.3.=20 > Alternate indexed hbase implementation; speeds scans by adding indexes to= regions rather secondary tables > -------------------------------------------------------------------------= -------------------------------- > > Key: HBASE-2037 > URL: https://issues.apache.org/jira/browse/HBASE-2037 > Project: Hadoop HBase > Issue Type: New Feature > Reporter: stack > Fix For: 0.20.3 > > Attachments: idx-hbase2.patch, idx-hbase3.patch, index.html > > > Purpose > The goal of the indexed HBase contrib is to speed up scans by indexing HB= ase columns. Indexed HBase (IHbase) is different from the indexed tables in= transactional HBase (ITHbase): while the indexes in ITHBase are, in fact, = hbase tables using the indexed column's values as row keys, IHbase creates = indexes at the region level. The differences are summarized in below. > + global ordering > ITHBase: yes > IHBase: no > Comment: IHBase has an index for each region. The flip side of not having= global ordering is compatibility with the good old HRegion: results are co= ming back in row order (and not value order as in THBase) > + Full table scan? > ITHBase: no > IHBase: no > Comment: ITHbase does a partial scan on the index table. IHbase supports = specifying start/end rows to limit the number of scanned regions > + Multiple Index Usage > ITHBase: no > IHBase: yes > Comment: IHBase can take advantage of multiple indexes in the same scan. = IHBase IdxScan object accepts an Expression which allows intersection/ unis= on of several indexed=20 > column criteria > + Extra disk storage > ITHBase: yes > IHBase: no > Comment: IHbase indexes are created when the region starts/flushes and do= not require any extra storage > + Extra RAM > ITHBase: yes > IHBase: yes > Comment: IHbase indexes are in memory and hence increase the memory overh= ead. THbase indexes increase the number of regions each region server has t= o support thus costing memory too > + Parallel scanning support > ITHBase: no > IHBase: yes > In ITHbase the index table needs to be consulted and then GETs are issued= for each matching row. The behavior of IHBase (as perceived by the client)= is no different than a regular scan and hence supports parallel scanning s= eamlessly. parallel GET can be implemented to speedup ITHbase scans > Why IHbase should outperform ITHBase > 1. More flexible: a. Supports range queries and multi-index queries b. Su= pports different types - not only byte arrays > 2. Less overhead: ITHbase pays at least two 'table roundtrips' - one for = the index table and the other for the main table > 3. Quicker index expression evaluation: IHBase is using dedicated index d= ata structures while ITHbase is using the regular HRegion scan facilities > Implementation notes > =E2=80=A2 Only index Storefiles.Every index scan performs a full memstore= scan. Indexing the memstore will be implemented only if scanning the memst= ore will prove to be a performance bottleneck > =E2=80=A2 Index expression evaluation is performed using bit sets.There a= re two types of bitsets: compressed and expanded. An index will typically s= tore a compressed bitset while an expression evaluator will most probably u= se an expanded bitset > + TODO > This patch changes some some of hbase core so can instantiate other than = default HRegion. Fixes bugs in filter too. > Would like to add this as a contrib. package on 0.20 branch in time for 0= .20.3 if possible. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.