hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clint Morgan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2426) [Transactional Contrib] Introduce quick scanning row-based secondary indexes
Date Thu, 15 Apr 2010 18:28:52 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857469#action_12857469
] 

Clint Morgan commented on HBASE-2426:
-------------------------------------

Hey George, thanks for the patch.

I have a question about how this improves performance over an
index layout similar to the SimpleIndexKeyGenerator. I have the same
requirements you mention above: namely I'd like to quickly finda all
rows in table A which have a value for COL1 of 'X'.

I build my index keys like <col1-value><sep><base-row-id> where <sep>
is a special byte sequence that does not occur in column values or row
keys. (Actually it can occur, if so I just escape it in the
index-row). Lets say <sep> is '__' in the example below

So if I have base rows:
ROW | COL_A
aaa | foo
bbb | bar
ccc | foo
ddd | zoo

Then my index would look like (just the rows are shown):
bar__bbb
foo__aaa
foo__ccc
zoo__ddd

So for the query find all rows where COL_A == foo, I do an index scan
starting at 'foo__' and ending at 'foo_*' (where * is the byte after
'_').

This will only scan through only the two index rows I wanted. Looks
like your patch will make it so rather than scanning two rows with on
cell each I scan one row with two cells each. I'm not 100% sure on the
specifics, but I think these two queries would generally be of the
same order of performance.

Do I understand things correctly? Is there a reason you could not use
the existing index mechanism for your needs?

I think we could do some work to make this pattern more obvious and
usable with the current infrastructure, but I'm a bit hesitant to add yet
another region/regionserver extension.

George, what do you think?

Slightly aside: When I read about AppEngine's index (a year ago or so), they said that they
maintain N index rows for a single base row (1 per column being indexed). I've been wanting
to rework this framework to support that as well, but it has not been a high priority as it
would require a rewrite of our query stuff that uses the current indexing layer. The approach
you take is the opposite: 1 index row for for N base rows. Not sure that really says anything,
but ...

> [Transactional Contrib] Introduce quick scanning row-based secondary indexes
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-2426
>                 URL: https://issues.apache.org/jira/browse/HBASE-2426
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: contrib
>            Reporter: George P. Stathis
>            Priority: Minor
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: hbase-2426-0.20-branch.patch
>
>
> RowBasedIndexSpecification is a specialized IndexSpecification class for creating row-based
secondary index tables. Base table rows with the same indexed column value have their row
keys stored as column qualifiers on the same secondary index table row. The key for that row
is the indexed column value from the base table. This allows to avoid expensive secondary
index table scans and provides faster access for applications such as foreign key indexing
or queries such as "find all table A rows whose familyA:columnB value is X". RowBasedIndexSpecification
indices can be scanned using the API on RowBasedIndexedTable. The metadata for RowBasedIndexSpecification
differ from IndexSpecification in that:
> - Only a single base table column can be indexed per RowBasedIndexSpecification. No additional
columns are put in the index table.
> and 
> - RowBasedIndexKeyGenerator, which constructs the index-row-key from the indexed column
value in the original column, is always used.
> For a simple RowBasedIndexSpecification example, look at the TestRowBasedIndexedTable
unit test in org.apache.hadoop.hbase.client.tableIndexed.
> To enable RowBasedIndexSpecification indexing, modify hbase-site.xml to turn on the
> IndexedRegionServer.  This is done by setting
> - hbase.regionserver.class to org.apache.hadoop.hbase.ipc.IndexedRegionInterface and
> - hbase.regionserver.impl to org.apache.hadoop.hbase.regionserver.tableindexed.RowBasedIndexedRegionServer

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message