accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Park (JIRA)" <>
Subject [jira] [Updated] (ACCUMULO-2827) HeapIterator optimization
Date Tue, 24 Jun 2014 19:42:24 GMT


Jonathan Park updated ACCUMULO-2827:

    Attachment: ACCUMULO-2827-compaction-performance-test.patch.2

I decided to re-run the tests but this time running each case 10 times in order to get rid
of some of the noise. I threw away the first few test results + averaging rates across the
10 runs. These #s look more promising and seems to better correlate with [~kturner]'s earlier

||files||rows||cols||rate w/o patch||rate w/ patch PQ||PQ speedup||

[~dlmarion] I'm not entirely sure what was going on in my earlier tests as we shouldn't have
seen that large of a hit. The single column case is a close approximation of the worst case
for this optimization since it should cause high % of interleaving across iterators which
goes against our assumption in the optimization. 

I've posted an updated patch (from keith's original) that includes the test harness used for
this test.

> HeapIterator optimization
> -------------------------
>                 Key: ACCUMULO-2827
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Jonathan Park
>            Assignee: Jonathan Park
>            Priority: Minor
>             Fix For: 1.5.2, 1.6.1, 1.7.0
>         Attachments: ACCUMULO-2827-compaction-performance-test.patch, ACCUMULO-2827-compaction-performance-test.patch.2,
ACCUMULO-2827.0.patch.txt, ACCUMULO-2827.1.patch.txt, ACCUMULO-2827.2.patch.txt, ACCUMULO-2827.3.patch.txt,, accumulo-2827.raw_data, new_heapiter.png, old_heapiter.png, together.png
> We've been running a few performance tests of our iterator stack and noticed a decent
amount of time spent in the HeapIterator specifically related to add/removal into the heap.
> This may not be a general enough optimization but we thought we'd see what people thought.
Our assumption is that it's more probable that the current "top iterator" will supply the
next value in the iteration than not. The current implementation takes the other assumption
by always removing + inserting the minimum iterator back into the heap. With the implementation
of a binary heap that we're using, this can get costly if our assumption is wrong because
we pay the log penalty of percolating up the iterator in the heap upon insertion and again
when percolating down upon removal.
> We believe our assumption is a fair one to hold given that as major compactions create
a log distribution of file sizes, it's likely that we may see a long chain of consecutive
entries coming from 1 iterator. Understandably, taking this assumption comes at an additional
cost in the case that we're wrong. Therefore, we've run a few benchmarking tests to see how
much of a cost we pay as well as what kind of benefit we see. I've attached a potential patch
(which includes a test harness) + image that captures the results of our tests. The x-axis
represents # of repeated keys before switching to another iterator. The y-axis represents
iteration time. The sets of blue + red lines varies in # of iterators present in the heap.

This message was sent by Atlassian JIRA

View raw message