commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex D Herbert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RNG-87) MultiplyWithCarry256
Date Fri, 29 Mar 2019 11:42:00 GMT

    [ https://issues.apache.org/jira/browse/RNG-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804852#comment-16804852
] 

Alex D Herbert commented on RNG-87:
-----------------------------------

To investigate the overhead that JMH adds to a test I have been through the JMH examples.
Here are some points highlighted in the examples (from [JMH 1.21|http://hg.openjdk.java.net/code-tools/jmh/file/f25ae8584db1/jmh-samples/src/main/java/org/openjdk/jmh/samples]):
 - JMH Sample 8 DeadCode - Set a baseline for the JMH infrastructure calling a method.
 - JMH Sample 8 DeadCode - Ensure no dead code elimination. Ensure the method has side effects
or return a value.
 - JMH Sample 10 Constant Fold - Do not work on final variables where the result is predicable
and can be optimised outside the benchmark. E.g. A baseline RNG should be unpredictable (have
internal state).
 - JMH Sample 11 Loops - Don't overuse loops inside the benchmark method. Loops can be heavily
unrolled/pipelined.
 - JMH Sample 16 CompilerControl - Allows control over inlining (so we can test inlining verses
not inlining a method)
 - JMH Sample 21 ConsumeCPU - States that time should scale linearly with work payload.
 - JMH Sample 34 SafeLooping - Can avoid the use of a blackhole by using a sink method that
is not inlined.
 - JMH Blackhole source for consumeCPU(long) - Loops should count down, not up.

So starting to investigate run time we should test with a loop and check it scales linearly,
plus we add a baseline to measure the impact of the JMH testing overhead. The test is pretty
standard but we count down this time:
{code:java}
@Benchmark
public void nextInt(Sources sources, Blackhole bh) {
    for (int i = numValues; i > 0; i--) {
        bh.consume(sources.getGenerator().nextInt());
    }
}
{code}
To measure the overhead we can use a stateful pseudo-RNG:
{code:java}
private static class IntValue extends IntProvider {
    /**
     * The fixed value to return.
     *
     * <p><strong>DON'T</strong> make this final!
     * This must be a viewed by the JVM as something that cannot be optimised away.
     */
    private int value;

    @Override
    public int next() {
        // This will not be optimised away since the source value is not final.
        return value;
    }
}

private static class IntCounter extends IntProvider {
    int state;

    @Override
    public int next() {
        return ++state;
    }
}
{code}
And we measure for small loops (5 - 25):

!small.jpg!

And large (500,000 - 2,500,000):

!big.jpg!

Notice that the MWC_256C is slower for small loops. I think this is related to the fact that
the code only refills the state every 256 calls. The JIT optimiser is probably able to work
with this and only call the refill once and the rest of the time just extracts the pre-computed
state. When the loops are big then there are so many calls this optimisation is not possible.

Conclusions:
 - A baseline shows that the JMH test has an overhead.
 - Even a simple {{int}} counter can be seen to be slower than returning the same value.
 - The scaling is linear for the generators
 - The scaling is linear for the baseline (so it is not being optimised away)

Note: Coefficient of determination (R^2) is > 0.9995 for all lines.

How to measure relative speed:

Per timepoint:
{noformat}
time = [payload time] - [baseline time]
relative time = time1 / time2
{noformat}
Small:
|Values|MWC_256|MWC_256C|Relative|
|5|11.29|12.17|1.08|
|10|23.00|26.15|1.14|
|15|35.55|37.17|1.05|
|20|46.82|49.29|1.05|
|25|59.48|63.05|1.06|

Big:
|Values|MWC_256|MWC_256C|Relative|
|500000|953,039.65|817,954.66|0.86|
|1000000|1,989,904.18|1,666,570.88|0.84|
|1500000|2,991,155.15|2,521,373.00|0.84|
|2000000|3,873,176.59|3,316,201.06|0.86|
|2500000|4,817,472.08|4,131,018.54|0.86|

Given the results scale linearly we can skip running all the different size loops and let
JMH do it. The code then uses an implicit Blackhole by returning the value. Note we add in
a baseline for calling a method and calling a method that returns a {{int}}:
{code:java}
@Benchmark
public void baselineMethodCall() {
    // do nothing, this is a baseline for JMH running a method
}

private int intValue;

@Benchmark
public int baselineConsumeInt() {
    return intValue;
}

@Benchmark
public int nextInt1(Sources sources) {
    return sources.getGenerator().nextInt();
}
{code}
 And we get:
|randomSourceName|Method|Score|Time|Relative|
| |baselineMethodCall|0.268| | |
| |baselineConsumeInt|2.048| | |
|IntValue|nextInt1|2.368| | |
|MWC_256B|nextInt1|3.982|1.614|0.933|
|MWC_256C|nextInt1|3.997|1.629|0.942|
|MWC_256|nextInt1|4.098|1.730|1.000|

Here the relative scores of the new methods are not quite as high as we saw on the big loop
test. But they are not as low as on the small loop test. I would guess looking at the standard
deviations (not shown) that they are also probably not significantly different although I've
not tested that.

In this test we have offloaded all the work to JMH for doing the repeat testing so avoiding
any other pitfalls that we did not consider in out loop (but they will have done).

Note also that the combined time for the baseline for an empty method call and the method
call to return an int are very close to the baseline {{IntValue}} provider (0.26 + 2.04 =
2.31 verses 2.36). So we need not include these and trust the dumb {{IntValue}} provider is
a good baseline reference.

 Let us revisit the difference in {{long}} generation with a benchmark for that:
|randomSourceName|Method|Score|Time|Relative|
|IntValue|nextLong1|2.570| | |
|MWC_256C|nextLong1|5.700|3.130|0.900|
|MWC_256B|nextLong1|5.709|3.139|0.902|
|MWC_256|nextLong1|6.048|3.478|1.000|

Here we again see no difference between the methods, although both are faster than the original.

So in conclusion either method is better than the original. For readablility I would vote
for method C:
{code:java}
@Override
public int next() {
    // Produce an index in the range 0-255
    index &= 0xff;
    final long t = A * (state[index] & 0xffffffffL) + carry;
    carry = (int) (t >> 32);
    return state[index++] = (int) t;
}
{code}

> MultiplyWithCarry256
> --------------------
>
>                 Key: RNG-87
>                 URL: https://issues.apache.org/jira/browse/RNG-87
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>              Labels: performance
>         Attachments: MWC_256.jpg, MWC_256_2.jpg, MWC_256_3.jpg, MWC_256_3.png, Shuffle.jpg,
big.jpg, small.jpg
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{MultiplyWithCarry256}} currently uses a length 256 array for internal state. This
is cycled through continuously. The current implementation refills the state every 256 calls
to next:
> {code:java}
> public int next() {
>     if (index == Q_SIZE) { // Whole state used up.
>         // Refill.
>         for (int i = 0; i < Q_SIZE; i++) {
>             final long t = A * (state[i] & 0xffffffffL) + carry;
>             carry = (int) (t >> 32);
>             state[i] = (int) t;
>         }
>         // Reset current index.
>         index = 0;
>     }
>     return state[index++];
> }
> {code}
> This can be changed to continuously update the state for only a single index on each
call to next:
> {code:java}
> public int next() {
>     // Produce an index in the range 0-255
>     final int i = index++ & 0xFF;
>     final long t = A * (state[i] & 0xffffffffL) + carry;
>     carry = (int) (t >> 32);
>     return state[i] = (int) t;
> }
> {code}
> This avoids an {{if}} statement check and *marginally* improves performance. It has the
advantage of not advancing the state further ahead than necessary.
> MWC_256B is the new version. JMH results from the GenerationPerformance benchmark.
> {noformat}
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00478|4.45e-05|0.00477|
> |1|MWC_256|nextInt|0.00521|1.69e-05|0.00521|
> |1|MWC_256B|nextInt|0.00519|0.000321|0.00512|
> |100|SPLIT_MIX_64|nextInt|0.421|0.0131|0.418|
> |100|MWC_256|nextInt|0.452|0.000968|0.452|
> |100|MWC_256B|nextInt|0.443|0.00183|0.442|
> |10000|SPLIT_MIX_64|nextInt|41.7|0.0725|41.7|
> |10000|MWC_256|nextInt|44.5|2.25|43.9|
> |10000|MWC_256B|nextInt|42.6|0.037|42.6|
> |1000000|SPLIT_MIX_64|nextInt|3.77e+03|21.2|3.77e+03|
> |1000000|MWC_256|nextInt|4.16e+03|12.8|4.16e+03|
> |1000000|MWC_256B|nextInt|3.98e+03|120|3.96e+03|
> {noformat}
> java version "11.0.2" 2019-01-15 LTS
> Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00469|4.98e-06|0.00469|
> |1|MWC_256|nextInt|0.00553|2.09e-05|0.00553|
> |1|MWC_256B|nextInt|0.00493|4.32e-05|0.00492|
> |100|SPLIT_MIX_64|nextInt|0.435|0.000624|0.435|
> |100|MWC_256|nextInt|0.467|0.00284|0.466|
> |100|MWC_256B|nextInt|0.455|0.00105|0.455|
> |10000|SPLIT_MIX_64|nextInt|43|4.94|41.9|
> |10000|MWC_256|nextInt|45.8|0.132|45.8|
> |10000|MWC_256B|nextInt|43|0.033|43|
> |1000000|SPLIT_MIX_64|nextInt|3.7e+03|7.88|3.7e+03|
> |1000000|MWC_256|nextInt|4.17e+03|45.3|4.15e+03|
> |1000000|MWC_256B|nextInt|4.12e+03|4.76|4.12e+03|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message