hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4465) Lazy-seek optimization for StoreFile scanners
Date Wed, 05 Oct 2011 17:56:31 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121283#comment-13121283
] 

jiraposter@reviews.apache.org commented on HBASE-4465:
------------------------------------------------------



bq.  On 2011-10-04 23:56:22, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java, line 306
bq.  > <https://reviews.apache.org/r/2180/diff/1/?file=47924#file47924line306>
bq.  >
bq.  >     Should be "lazily-sought"

Somehow "sought" does not sound right for me -- "seek" is a very specific computer science
term here. Replaced with "has done a seek operation" here and below.


bq.  On 2011-10-04 23:56:22, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java, line 246
bq.  > <https://reviews.apache.org/r/2180/diff/1/?file=47930#file47930line246>
bq.  >
bq.  >     Should realSeekDone be set before returning ?

realSeekDone is set to true by enforceSeek() in case we decide we need to do it as part of
requestSeek()

      realSeekDone = false;  // <-- setting this by default in case lazy seek 
                             // takes effect or enforceSeek() fails
      . . .
      if (seekTimestamp > maxTimestampInFile) {
        // Create a fake key that is not greater than the real next key.
        // (Lower timestamps correspond to higher KVs.)
        // To understand this better, consider that we are asked to seek to
        // a higher timestamp than the max timestamp in this file. We know that
        // the next point when we have to consider this file again is when we
        // pass the max timestamp of this file (with the same row/column).
        cur = kv.createFirstOnRowColTS(maxTimestampInFile);
      } else {
        // This will be the case e.g. when we need to seek to the next
        // row/column, and we don't know exactly what they are, so we set the
        // seek key's timestamp to OLDEST_TIMESTAMP to skip the rest of this
        // row/column.
        enforceSeek();  // <-- this sets realSeekDone
      }


bq.  On 2011-10-04 23:56:22, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java, line 371
bq.  > <https://reviews.apache.org/r/2180/diff/1/?file=47924#file47924line371>
bq.  >
bq.  >     Should be 'real-sought' and 'lazily-sought'

Done.


bq.  On 2011-10-04 23:56:22, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java, line 101
bq.  > <https://reviews.apache.org/r/2180/diff/1/?file=47925#file47925line101>
bq.  >
bq.  >     Should be 'is sought'

Done.


- Mikhail


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2180/#review2332
-----------------------------------------------------------


On 2011-10-04 22:10:40, Mikhail Bautin wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2180/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-10-04 22:10:40)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Previously, if we had several StoreFiles for a column family in a region, we would seek
in each of them and only then merge the results, even though the row/column we are looking
for might only be in the most recent (and the smallest) file. Now we prioritize our reads
from those files so that we check the most recent file first. This is done by doing a "lazy
seek" which pretends that the next value in the StoreFile is (seekRow, seekColumn, lastTimestampInStoreFile),
which is earlier in the KV order than anything that might actually occur in the file. So if
we don't find the result in earlier files, that fake KV will bubble up to the top of the KV
heap and a real seek will be done. This is expected to significantly reduce the amount of
disk IO (as of 09/22/2011 we are doing dark launch testing and measurement).
bq.  
bq.  This is joint work with Liyin Tang – huge thanks to him for many helpful discussions
on this and the idea of putting fake KVs with the highest timestamp of the StoreFile in the
scanner priority queue.
bq.  
bq.  
bq.  This addresses bug HBASE-4465.
bq.      https://issues.apache.org/jira/browse/HBASE-4465
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/KeyValue.java aa34006 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 94ddce7

bq.    src/main/java/org/apache/hadoop/hbase/regionserver/ColumnCount.java 1be0280 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java b8d33e8 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java fbcd276 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 035f765 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/NonLazyKeyValueScanner.java PRE-CREATION

bq.    src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java dad278a 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java abb5931 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 31bfea7 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 64a6e3e 
bq.    src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 8ad5aab 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStore.java 9d2b2a7 
bq.  
bq.  Diff: https://reviews.apache.org/r/2180/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Running unit tests -- please do not commit yet.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.


                
> Lazy-seek optimization for StoreFile scanners
> ---------------------------------------------
>
>                 Key: HBASE-4465
>                 URL: https://issues.apache.org/jira/browse/HBASE-4465
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>              Labels: optimization, seek
>             Fix For: 0.89.20100924, 0.94.0
>
>
> Previously, if we had several StoreFiles for a column family in a region, we would seek
in each of them and only then merge the results, even though the row/column we are looking
for might only be in the most recent (and the smallest) file. Now we prioritize our reads
from those files so that we check the most recent file first. This is done by doing a "lazy
seek" which pretends that the next value in the StoreFile is (seekRow, seekColumn, lastTimestampInStoreFile),
which is earlier in the KV order than anything that might actually occur in the file. So if
we don't find the result in earlier files, that fake KV will bubble up to the top of the KV
heap and a real seek will be done. This is expected to significantly reduce the amount of
disk IO (as of 09/22/2011 we are doing dark launch testing and measurement).
> This is joint work with Liyin Tang -- huge thanks to him for many helpful discussions
on this and the idea of putting fake KVs with the highest timestamp of the StoreFile in the
scanner priority queue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message