hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Corgan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9969) Improve KeyValueHeap using loser tree
Date Fri, 15 Nov 2013 06:11:22 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823336#comment-13823336

Matt Corgan commented on HBASE-9969:

 private List<KeyValue> randomKeyValues() {
    List<KeyValue> kvs = Lists.newArrayListWithExpectedSize(NUM_KEYS_PER_SCANNER);
    for (int i = 0; i < NUM_KEYS_PER_SCANNER; i++) {
      byte[] row = Bytes.toBytes(random.nextLong());
      KeyValue kv = new KeyValue(row, FAMILY, null);
    return kvs;

I wonder if you should make ~10 cols/row, and NUM_KEYS_PER_SCANNER/10 rows.  This should exacerbate
the problem i mentioned above.  Right now it's testing the ideal scenario which almost never
exists.  In fact, with time series data you have the exact opposite scenario where every Cell
should go straight to the front of the queue.

{quote}I don't quite understand this. The one I understood to improve is to compare against
the previous second top value (i.e. tree[1]) before going through leaf to root comparisions.
But this will add one more comparison on the worst case.{quote}Could a slightly customized
version of the algorithm eliminate the extra comparison for the worst case scenario?  Even
if not, I think the best case scenario happens 10x+ as often as worst case, so maybe it's
a good trade-off.

> Improve KeyValueHeap using loser tree
> -------------------------------------
>                 Key: HBASE-9969
>                 URL: https://issues.apache.org/jira/browse/HBASE-9969
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance, regionserver
>            Reporter: Chao Shi
>            Assignee: Chao Shi
>             Fix For: 0.98.0, 0.96.1
>         Attachments: hbase-9969-v2.patch, hbase-9969.patch, hbase-9969.patch, kvheap-benchmark.png,
> LoserTree is the better data structure than binary heap. It saves half of the comparisons
on each next(), though the time complexity is on O(logN).
> Currently A scan or get will go through two KeyValueHeaps, one is merging KVs read from
multiple HFiles in a single store, the other is merging results from multiple stores. This
patch should improve the both cases whenever CPU is the bottleneck (e.g. scan with filter
over cached blocks, HBASE-9811).
> All of the optimization work is done in KeyValueHeap and does not change its public interfaces.
The new code looks more cleaner and simpler to understand.

This message was sent by Atlassian JIRA

View raw message