accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Will Murnane (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4468) accumulo.core.data.Key.equals(Key, PartialKey) improvement
Date Wed, 21 Sep 2016 21:00:24 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511154#comment-15511154
] 

Will Murnane commented on ACCUMULO-4468:
----------------------------------------

Responding to [~ctubbsii]'s points first:

# Most things in Accumulo are dealt with in sorted order, so I think it's reasonable to optimize
for the case where things are sorted. Whether this is true in the general case is a function
of how one calls the function, but it's not a change to the functionality of the code, just
its performance. All the places I can see it used in Accumulo, the keys that are being compared
are coming from SortedKeyValueIterator, so it's reasonable that we're comparing keys which
are sorted, and the new behavior will be an improvement.
# I ran some tests with JMH to see what the difference is like. I can post the whole project,
but here are some selected results. There are three functions being compared. I implemented
a subclass of Key which has new methods: my proposed equals() function is called customWill,
the original equals() copied to the new class is called customVanilla, and the original equals()
in the original Key class is called standardEquals. The numbers to compare are really the
two custom* ones; the standardEquals value is given just to show that it's in the same ballpark.
\\ The test data set is generated before the benchmark begins, and contains 1m keys to go
through, calling previous.equals(next, ROW_COLFAM). This is a worst-case scenario for sorted
input, because the most that it can manage to avoid doing (as compared to the current implementation)
is comparing the row values.
## I generated rows whose rowID are all equal, and the column family changes every key.
{noformat}
Benchmark                    Mode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  46.320 ± 3.803  ops/s
MyBenchmark.customWill      thrpt   30  88.349 ± 2.723  ops/s
MyBenchmark.standardEquals  thrpt   30  36.736 ± 0.883  ops/s
{noformat}
## I generated rows whose rowID are all equal, and the column family changes every 3 keys.
{noformat}
Benchmark                    Mode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  30.684 ± 1.258  ops/s
MyBenchmark.customWill      thrpt   30  34.292 ± 1.339  ops/s
MyBenchmark.standardEquals  thrpt   30  27.277 ± 0.984  ops/s
{noformat}
## I generated keys whose rowID are all equal, and the column family changes every 5 keys.

{noformat}
Benchmark                    Mode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  27.195 ± 0.895  ops/s
MyBenchmark.customWill      thrpt   30  30.048 ± 0.838  ops/s
MyBenchmark.standardEquals  thrpt   30  25.044 ± 0.731  ops/s
{noformat}
## Finally, I generated keys whose rowID are all equal, and the column family changes every
1000 keys.
{noformat}
Benchmark                    Mode  Cnt   Score   Error  Units
MyBenchmark.customVanilla   thrpt   30  23.447 ± 1.010  ops/s
MyBenchmark.customWill      thrpt   30  23.427 ± 0.371  ops/s
MyBenchmark.standardEquals  thrpt   30  22.356 ± 0.192  ops/s
{noformat}
# As a user of the API, I'd rather not have to think about equalsForward() versus equalsBackward().
Maybe add an optional flag to specify direction of comparison, for 
# I agree in general, but I would argue that the code of the current equals() method is messier
and harder to read. It repeats the same code quite a bit. I wrote a comment in the patch remarking
on the fact that there's fallthrough and it's used intentionally, to try to prevent future
confusion.

[~elserj] I agree that any change should start with prejudice against it. However, I think
the numbers above prove my case: when keys are presented in sorted order, which happens often
in Accumulo, the proposed method of comparing is slightly but noticeably faster. The degree
of improvement depends on the data, but it doesn't perform worse than the current solution
in any case that I tested.

> accumulo.core.data.Key.equals(Key, PartialKey) improvement
> ----------------------------------------------------------
>
>                 Key: ACCUMULO-4468
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4468
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.8.0
>            Reporter: Will Murnane
>            Priority: Trivial
>              Labels: newbie, performance
>         Attachments: key_comparison.patch
>
>
> In the Key.equals(Key, PartialKey) overload, the current method compares starting at
the beginning of the key, and works its way toward the end. This functions correctly, of course,
but one of the typical uses of this method is to compare adjacent rows to break them into
larger chunks. For example, accumulo.core.iterators.Combiner repeatedly calls this method
with subsequent pairs of keys.
> I have a patch which reverses the comparison order. That is, if the method is called
with ROW_COLFAM_COLQUAL_COLVIS, it will compare visibility, cq, cf, and finally row. This
(marginally) improves the speed of comparisons in the relatively common case where only the
last part is changing, with less complex code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message