hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2265) HFile and Memstore should maintain minimum and maximum timestamps
Date Mon, 05 Jul 2010 19:16:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885306#action_12885306

HBase Review Board commented on HBASE-2265:

Message from: "Pranav Khaitan" <pranavkhaitan@facebook.com>

This is an automatically generated e-mail. To reply, visit:

(Updated 2010-07-05 12:04:54.536554)

Review request for hbase, Jonathan Gray, Karthik Ranganathan, and Kannan Muthukkaruppan.


Every memstore and store file will have a minimum and maximum timestamp associated with it.
If the range of timestamps we are searching for doesn't overlap with the range for a particular
file, we can skip searching it and save time.

Would significantly improve the performance for timestamp range queries. Particularly useful
when most of the reads are for recent entries and the older files can be safely skipped. 

Addresses HBASE-2265 JIRA. 

This diff includes fixing some minor bugs like KeyValueHeap used to throw an uncaught exception
when size of scanner set was zero. 

This addresses bug HBASE-2265.


  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueSkipListSet.java 959782

  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 959782 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/TimeRangeTracker.java PRE-CREATION

  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStore.java 960082 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 959782 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 959782 

Diff: http://review.hbase.org/r/257/diff


All existing JUnit tests run successfully. More JUnit tests for Memstore, StoreFile and Store
added to test correctness with multiple timestamps.

Conducted a test to measure the extra time required to keep track of min and max timestamps
while writing KeyValues.  The comparison was done by entering 1 Million KeyValues into memstore
ten times with and without timestamp tracking and then taking the average time for each of
them.  WAL was disabled and no flushing was done during this test to minimize overheads. The
average time taken for entering 1M KeyValues into memstore without keeping track of timestamp
was 13.44 seconds while the average time when keeping track of timestamps was 13.45 seconds.
This shows that no significant overhead has been added while keeping track of timestamps.



> HFile and Memstore should maintain minimum and maximum timestamps
> -----------------------------------------------------------------
>                 Key: HBASE-2265
>                 URL: https://issues.apache.org/jira/browse/HBASE-2265
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Todd Lipcon
>            Assignee: Pranav Khaitan
> In order to fix HBASE-1485 and HBASE-29, it would be very helpful to have HFile and Memstore
track their maximum and minimum timestamps. This has the following nice properties:
> - for a straight Get, if an entry has been already been found with timestamp X, and X
>= HFile.maxTimestamp, the HFile doesn't need to be checked. Thus, the current fast behavior
of get can be maintained for those who use strictly increasing timestamps, but "correct" behavior
for those who sometimes write out-of-order.
> - for a scan, the "latest timestamp" of the storage can be used to decide which cell
wins, even if the timestamp of the cells is equal. In essence, rather than comparing timestamps,
instead you are able to compare tuples of (row timestamp, storage.max_timestamp)
> - in general, min_timestamp(storage A) >= max_timestamp(storage B) if storage A was
flushed after storage B.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message