hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4465) Lazy-seek optimization for StoreFile scanners
Date Wed, 05 Oct 2011 18:00:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121292#comment-13121292
] 

jiraposter@reviews.apache.org commented on HBASE-4465:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2180/
-----------------------------------------------------------

(Updated 2011-10-05 18:00:03.234737)


Review request for hbase.


Changes
-------

Updating testing done.


Summary
-------

Previously, if we had several StoreFiles for a column family in a region, we would seek in
each of them and only then merge the results, even though the row/column we are looking for
might only be in the most recent (and the smallest) file. Now we prioritize our reads from
those files so that we check the most recent file first. This is done by doing a "lazy seek"
which pretends that the next value in the StoreFile is (seekRow, seekColumn, lastTimestampInStoreFile),
which is earlier in the KV order than anything that might actually occur in the file. So if
we don't find the result in earlier files, that fake KV will bubble up to the top of the KV
heap and a real seek will be done. This is expected to significantly reduce the amount of
disk IO (as of 09/22/2011 we are doing dark launch testing and measurement).

This is joint work with Liyin Tang – huge thanks to him for many helpful discussions on
this and the idea of putting fake KVs with the highest timestamp of the StoreFile in the scanner
priority queue.


This addresses bug HBASE-4465.
    https://issues.apache.org/jira/browse/HBASE-4465


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/KeyValue.java aa34006 
  src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 94ddce7

  src/main/java/org/apache/hadoop/hbase/regionserver/ColumnCount.java 1be0280 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java b8d33e8 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java fbcd276 
  src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 035f765 
  src/main/java/org/apache/hadoop/hbase/regionserver/NonLazyKeyValueScanner.java PRE-CREATION

  src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java dad278a 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java abb5931 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 31bfea7 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 64a6e3e 
  src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 8ad5aab 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java b3beabb 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStore.java 9d2b2a7 

Diff: https://reviews.apache.org/r/2180/diff


Testing (updated)
-------

All unit tests should be passing now. Will rebase and re-run again just in case.


Thanks,

Mikhail


                
> Lazy-seek optimization for StoreFile scanners
> ---------------------------------------------
>
>                 Key: HBASE-4465
>                 URL: https://issues.apache.org/jira/browse/HBASE-4465
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>              Labels: optimization, seek
>             Fix For: 0.89.20100924, 0.94.0
>
>
> Previously, if we had several StoreFiles for a column family in a region, we would seek
in each of them and only then merge the results, even though the row/column we are looking
for might only be in the most recent (and the smallest) file. Now we prioritize our reads
from those files so that we check the most recent file first. This is done by doing a "lazy
seek" which pretends that the next value in the StoreFile is (seekRow, seekColumn, lastTimestampInStoreFile),
which is earlier in the KV order than anything that might actually occur in the file. So if
we don't find the result in earlier files, that fake KV will bubble up to the top of the KV
heap and a real seek will be done. This is expected to significantly reduce the amount of
disk IO (as of 09/22/2011 we are doing dark launch testing and measurement).
> This is joint work with Liyin Tang -- huge thanks to him for many helpful discussions
on this and the idea of putting fake KVs with the highest timestamp of the StoreFile in the
scanner priority queue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message