hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-12311) Version stats in HFiles?
Date Sat, 28 Feb 2015 06:04:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341363#comment-14341363
] 

Lars Hofhansl edited comment on HBASE-12311 at 2/28/15 6:03 AM:
----------------------------------------------------------------

I've thought of another approach. StorefileScanners have the notion of a "next indexed key",
that is next known key to seek to (i.e. beginning of a block). What if we took the next indexed
key of the scanner that is on top of the (StoreFileScanner/MemstoreScanner) heap and only
issue a seek if we would seek past that key? It's only a heuristic and that check would not
come free, but assuming it likely that chunks of Cells will come from the same file, we'd
have a fairly good indicator whether the seek will help. I have a 0.98 patch for that, and
it improves things. As an example I've used a scan with a timerange. If the range is before
all Cells (except one so that the files isn't ruled out) it's takes about 3.1s (we SKIP in
that case) if the timerange falls after all Cells (again except one) it's 10.2s (we're seeking
this time - see SQM.match). 

With the patch the first case is unchanged (3.1s), but the 2nd case it reduced to 4.5s, since
can avoid the unnecessary seek in many cases.



was (Author: lhofhansl):
I've thought of another approach. StorefileScanners have the notion of the "next indexed key",
that is next known key to seek to (i.e. beginning of a block). What if we took the next indexed
key of the scanner that is on top of the heap and only issue a seek if we would seek past
that key? It's only a heuristic and that check would not come free, but assuming it likely
that chunks of the Cells will come from the same file, we'd have a fairly good indicator whether
the seek will help. I have a 0.98 patch for that, and it improves things. As an example I've
used a range with the timerange. If the range is before all Cells (except one so that the
files isn't ruled out) it's takes about 3.1s (we SKIP in that case) if the timerange fall
after all Cells (again except one) it 10.2s (we're seeking this time). 

With the patch the first case is unchanged (3.1s), but the 2nd case it reduced to 4.5s, since
can avoid the unnecessary in many cases.


> Version stats in HFiles?
> ------------------------
>
>                 Key: HBASE-12311
>                 URL: https://issues.apache.org/jira/browse/HBASE-12311
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Lars Hofhansl
>         Attachments: 12311-indexed-0.98.txt, 12311-v2.txt, 12311-v3.txt, 12311.txt, CellStatTracker.java
>
>
> In HBASE-9778 I basically punted the decision on whether doing repeated scanner.next()
called instead of the issueing (re)seeks to the user.
> I think we can do better.
> One way do that is maintain simple stats of what the maximum number of versions we've
seen for any row/col combination and store these in the HFile's metadata (just like the timerange,
oldest Put, etc).
> Then we estimate fairly accurately whether we have to expect lots of versions (i.e. seek
between columns is better) or not (in which case we'd issue repeated next()'s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message