commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex D Herbert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RNG-87) MultiplyWithCarry256
Date Wed, 27 Mar 2019 00:26:00 GMT

    [ https://issues.apache.org/jira/browse/RNG-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802299#comment-16802299
] 

Alex D Herbert commented on RNG-87:
-----------------------------------

I have investigated the overhead involved in the test by repeat calling the RNG inside the
loop. The return value is used in an addition to prevent it being optimised away. Here's an
example for 3 calls:
{code:java}
public void nextInt3(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator();
    for (int i = numValues; i-- != 0; ) {
        value += rng.nextInt();
        value += rng.nextInt();
        value += rng.nextInt();
    }
    bh.consume(value);
}
{code}
A picture tells a 1000 words...

!MWC_256.jpg!

There is something strange happening with the start-up cost or test overhead of the original
MWC_256 algorithm. The slope is perhaps slightly steeper than the others but the difference
at 1 call to RNG is so big it appears to be twice as slow. When 7 calls are made the difference
is 43% more than MWC_256C.

I would say that this still looks like a victory for MWC_256C. The gradient is less than for
MWC_256B.

This seems to indicate that the current benchmarking method of using a loop and consuming
the output value from the RNG inside the loop is not suitable. The overhead in the loop construct
and the Blackhole consume method *relative* to the speed that the RNG can produce numbers
is noticable. This can be reduced by running a second loop inside the first.

In the example above I manually unrolled the loop. I've also tried an inner code loop with
a constant for the loop limit and this produces a similar result although the two new methods
are now very close:

!MWC_256_2.jpg!

In this case the inner loop may be preventing some optimisations with MWC_256. It does not
scale close to the other methods.

To add fuel to the fire I let JMH set the number of values for the inner loop:
{code:java}
public void nextInt(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator();
    for (int i = numValues; i-- > 0;) {
        for (int j = numValues2; j-- > 0;) {
            value += rng.nextInt();
        }
    }
    bh.consume(value);
}
{code}
!MWC_256_3.jpg!

So:

3 methods for using an inner loop to make the RNG do extra work to investigate scaling:
 * Manually unroll a loop. The original MWC_256 is a bit worse. There is some noticeable start-up
cost for when the inner iteration is 1 for the MWC_256 method which makes it appears worse
than it is relative to the others.
 * Code an inner loop with a constant (hopefully it will be unrolled although this is not
certain). The original MWC_256 is a lot worse.
 * Code an inner loop with a limit set using a JMH controlled variable. The original MWC_256
is worse.

It is not clear cut which of MWC_256B or C is better as each attempt to run a benchmark had
different winners. It would be a 1 win, 1 draw and 1 lose for C vs B in the order listed above.

The original is worse in all scenarios. Whichever test method is most applicable to real world
scenarios where more work is done with the random number that is generated I cannot say. So
let's do a test of some less trivial work:
{code:java}
public void shuffle(Sources sources, Blackhole bh) {
    UniformRandomProvider rng = sources.getGenerator();
    // Work array
    int[] array = new int[numValues2];
    for (int j = numValues2; j-- > 0;) {
        array[j] = j;
    }
    for (int i = numValues; i-- > 0;) {
        for (int j = numValues2; j-- > 0;) {
            // Fischer-Yates shuffle
            int k = rng.nextInt(j + 1);
            int tmp = array[j];
            array[j] = array[k];
            array[k] = tmp;
        }
    }
    bh.consume(array[0]);
}
{code}
!Shuffle.jpg!

So the new methods (B & C) are preferable here with method C the marginal winner.

 

> MultiplyWithCarry256
> --------------------
>
>                 Key: RNG-87
>                 URL: https://issues.apache.org/jira/browse/RNG-87
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>              Labels: performance
>         Attachments: MWC_256.jpg, MWC_256_2.jpg, MWC_256_3.jpg, MWC_256_3.png, Shuffle.jpg
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{MultiplyWithCarry256}} currently uses a length 256 array for internal state. This
is cycled through continuously. The current implementation refills the state every 256 calls
to next:
> {code:java}
> public int next() {
>     if (index == Q_SIZE) { // Whole state used up.
>         // Refill.
>         for (int i = 0; i < Q_SIZE; i++) {
>             final long t = A * (state[i] & 0xffffffffL) + carry;
>             carry = (int) (t >> 32);
>             state[i] = (int) t;
>         }
>         // Reset current index.
>         index = 0;
>     }
>     return state[index++];
> }
> {code}
> This can be changed to continuously update the state for only a single index on each
call to next:
> {code:java}
> public int next() {
>     // Produce an index in the range 0-255
>     final int i = index++ & 0xFF;
>     final long t = A * (state[i] & 0xffffffffL) + carry;
>     carry = (int) (t >> 32);
>     return state[i] = (int) t;
> }
> {code}
> This avoids an {{if}} statement check and *marginally* improves performance. It has the
advantage of not advancing the state further ahead than necessary.
> MWC_256B is the new version. JMH results from the GenerationPerformance benchmark.
> {noformat}
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00478|4.45e-05|0.00477|
> |1|MWC_256|nextInt|0.00521|1.69e-05|0.00521|
> |1|MWC_256B|nextInt|0.00519|0.000321|0.00512|
> |100|SPLIT_MIX_64|nextInt|0.421|0.0131|0.418|
> |100|MWC_256|nextInt|0.452|0.000968|0.452|
> |100|MWC_256B|nextInt|0.443|0.00183|0.442|
> |10000|SPLIT_MIX_64|nextInt|41.7|0.0725|41.7|
> |10000|MWC_256|nextInt|44.5|2.25|43.9|
> |10000|MWC_256B|nextInt|42.6|0.037|42.6|
> |1000000|SPLIT_MIX_64|nextInt|3.77e+03|21.2|3.77e+03|
> |1000000|MWC_256|nextInt|4.16e+03|12.8|4.16e+03|
> |1000000|MWC_256B|nextInt|3.98e+03|120|3.96e+03|
> {noformat}
> java version "11.0.2" 2019-01-15 LTS
> Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00469|4.98e-06|0.00469|
> |1|MWC_256|nextInt|0.00553|2.09e-05|0.00553|
> |1|MWC_256B|nextInt|0.00493|4.32e-05|0.00492|
> |100|SPLIT_MIX_64|nextInt|0.435|0.000624|0.435|
> |100|MWC_256|nextInt|0.467|0.00284|0.466|
> |100|MWC_256B|nextInt|0.455|0.00105|0.455|
> |10000|SPLIT_MIX_64|nextInt|43|4.94|41.9|
> |10000|MWC_256|nextInt|45.8|0.132|45.8|
> |10000|MWC_256B|nextInt|43|0.033|43|
> |1000000|SPLIT_MIX_64|nextInt|3.7e+03|7.88|3.7e+03|
> |1000000|MWC_256|nextInt|4.17e+03|45.3|4.15e+03|
> |1000000|MWC_256B|nextInt|4.12e+03|4.76|4.12e+03|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message