hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2037) Alternate indexed hbase implementation; speeds scans by adding indexes to regions rather secondary tables
Date Wed, 06 Jan 2010 18:35:54 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797229#action_12797229
] 

stack commented on HBASE-2037:
------------------------------

I'm looking into it....  If I can't fix this, I think we should back out hbase-2037.

> Alternate indexed hbase implementation; speeds scans by adding indexes to regions rather
secondary tables
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2037
>                 URL: https://issues.apache.org/jira/browse/HBASE-2037
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>             Fix For: 0.20.3
>
>         Attachments: idx-hbase2.patch, idx-hbase3.patch, index.html
>
>
> Purpose
> The goal of the indexed HBase contrib is to speed up scans by indexing HBase columns.
Indexed HBase (IHbase) is different from the indexed tables in transactional HBase (ITHbase):
while the indexes in ITHBase are, in fact, hbase tables using the indexed column's values
as row keys, IHbase creates indexes at the region level. The differences are summarized in
below.
> + global ordering
> ITHBase: yes
> IHBase: no
> Comment: IHBase has an index for each region. The flip side of not having global ordering
is compatibility with the good old HRegion: results are coming back in row order (and not
value order as in THBase)
> + Full table scan?
> ITHBase: no
> IHBase: no
> Comment: ITHbase does a partial scan on the index table. IHbase supports specifying start/end
rows to limit the number of scanned regions
> + Multiple Index Usage
> ITHBase: no
> IHBase: yes
> Comment: IHBase can take advantage of multiple indexes in the same scan. IHBase IdxScan
object accepts an Expression which allows intersection/ unison of several indexed 
> column criteria
> + Extra disk storage
> ITHBase: yes
> IHBase: no
> Comment: IHbase indexes are created when the region starts/flushes and do not require
any extra storage
> + Extra RAM
> ITHBase: yes
> IHBase: yes
> Comment: IHbase indexes are in memory and hence increase the memory overhead. THbase
indexes increase the number of regions each region server has to support thus costing memory
too
> + Parallel scanning support
> ITHBase: no
> IHBase: yes
> In ITHbase the index table needs to be consulted and then GETs are issued for each matching
row. The behavior of IHBase (as perceived by the client) is no different than a regular scan
and hence supports parallel scanning seamlessly. parallel GET can be implemented to speedup
ITHbase scans
> Why IHbase should outperform ITHBase
> 1. More flexible: a. Supports range queries and multi-index queries b. Supports different
types - not only byte arrays
> 2. Less overhead: ITHbase pays at least two 'table roundtrips' - one for the index table
and the other for the main table
> 3. Quicker index expression evaluation: IHBase is using dedicated index data structures
while ITHbase is using the regular HRegion scan facilities
> Implementation notes
> • Only index Storefiles.Every index scan performs a full memstore scan. Indexing the
memstore will be implemented only if scanning the memstore will prove to be a performance
bottleneck
> • Index expression evaluation is performed using bit sets.There are two types of bitsets:
compressed and expanded. An index will typically store a compressed bitset while an expression
evaluator will most probably use an expanded bitset
> + TODO
> This patch changes some some of hbase core so can instantiate other than default HRegion.
 Fixes bugs in filter too.
> Would like to add this as a contrib. package on 0.20 branch in time for 0.20.3 if possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message