accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Fuchs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks
Date Sat, 30 Jun 2012 19:01:42 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404607#comment-13404607
] 

Adam Fuchs commented on ACCUMULO-473:
-------------------------------------

To throw out a competing idea, what if we were to build a reverse skip list online as we serialize
the block the first time? Then we could support binary search in the block and even scan backwards
relatively efficiently. I'm not exactly sure how to quickly evaluate the relative key encoding,
but maybe we can resolve those pointers as we pull the block into the block cache.

There would be a small amount of extra storage required, but we could bound this to as small
as we want using grouping and encoding techniques. Write speed shouldn't be impacted much,
and there would be no extra complexity in caching/managing the blocks for query.
                
> Support binary search within RFile blocks
> -----------------------------------------
>
>                 Key: ACCUMULO-473
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-473
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.5.0
>
>
> RFiles store blocks of key values using relative encoding.  By default these blocks are
small (100k).  To find a key in an RFile the index is used to find the block, then the block
is scanned for the key.  It would be nice for iterators that do alot of seeking if a binary
search were done instead of a scan.  
> Relative encoding is a form of compression that serializes a key relative to the last
key.  For example if the row in a key is the same as the previous key, then the row is not
stored again.  This works well with sorted data.   The current on disk format does not support
random access within a block, because to read entry N all previous entries must be read. 
 One option would be to deserialize the block into a form that supports random access and
then do binary search.  However this will consume memory and CPU.  If the relative encoding
format could be modified to support random access, then binary search could be supported in
a memory and CPU efficient way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message