hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4443) optimize/avoid seeking to "previous" block when key you are interested in is the first one of a block
Date Fri, 11 May 2012 17:58:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273459#comment-13273459

Todd Lipcon commented on HBASE-4443:

Hey Mikhail. I had exactly the same thought last night when I read this JIRA.

But I thought that Liyin brought up some other advantages of indexing by the last key instead
of the first -- namely that we can easily skip to the next block without having to go back
to the index lookup to find the next block's first key.

Is the assumption that we would use the "as far left as possible" index in our current HFile
version, and then increment the version later to change to index by last key?
> optimize/avoid seeking to "previous" block when key you are interested in is the first
one of a block
> -----------------------------------------------------------------------------------------------------
>                 Key: HBASE-4443
>                 URL: https://issues.apache.org/jira/browse/HBASE-4443
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
> This issue primarily affects cases when you are storing large blobs, i.e. when blocks
contain small number of keys, and the chances of the key you are looking for being the first
block of a key is higher.
> Say, you are looking for "row/col", and "row/col/ts=5" is the latest version of the key
in the HFile and is at the beginning of block X.
> The search for the key is done by looking for "row/col/TS=Long.MAX_VAL", but this will
land us in block X-1 (because ts=Long.MAX_VAL sorts ahead of ts=5); only to find that there
is no matching "row/col" in block X-1, and then we'll advance to block X to return the value.
> Seems like we should be able to optimize this somehow.
> Some possibilities:
> 1) Suppose we track that the  file contains no deletes, and if the CF setting has MAX_VERSIONS=1,
we can know for sure that block X - 1 does not contain any relevant data, and directly position
the seek to block X. [This will also require the memstore flusher to remove extra versions
if MAX_VERSION=1 and not allow the file to contain duplicate entries for the same ROW/COL.]
 Tracking deletes will also avoid in many cases, the seek to the top of the row to look for
> 2) Have a dense index (1 entry per KV in the index; this might be ok for large object
case since index vs. data ratio will still be low).
> 3) Have the index contain the last KV of each block also in addition to the first KV.
This doubles the size of the index though.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message