commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex D Herbert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RNG-54) StringSampler
Date Tue, 25 Sep 2018 13:09:00 GMT

    [ https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627330#comment-16627330
] 

Alex D Herbert commented on RNG-54:
-----------------------------------

{quote}if radix is restricted to small powers of 2, I'd suggest defining an enum
{quote}
I can see merrit in this. If you use a non-power of two then you have to call {{nextInt(int)}}
to get the random radix. So it is out of the scope of fast algorithms that rely on random
bytes.

Here's the problem formulated using {{int}} as the random source of bytes but it is easily
converted for {{long}}.
{noformat}
int bits = The radix
Samples per int = 32 / bits
{noformat}
||Bits||Samples per int||Remainder||As Fraction||Optimum||Name||
|1|32|0|32|32 per int|Binary|
|2|16|0|16|16 per int|-|
|3|10|2|10 2/32|32 per 3 int|Octal|
|4|8|0|8|8 per int|Hex|
|5|6|2|6 2/32|32 per 5 int|-|
|6|5|2|5 2/32|16 per 3 int|Base64 ASCII armoured encoding|
|7|4|4|4 4/32|32 per 7 int|-|

Note: 8 bits is just the current method {{nextBytes(byte[])}} so I've left that out.

There are two versions for this:
 - The first version is to extract the whole samples from each {{int}} and discard the remainder.

 - The second version is to collect the remainders until the sum can be factorised by the
number of bits. They can then be converted into more samples.

Out-of-interest I've tried both for Base64. For Base64 strings you need 3 {{int}} samples
and then you can get an extra Base64 character for 'free'. JMH could measure a performance
benefit, even though the 'free' sample provides 16 rather than 15, i.e. 6% more for a fair
bit of extra code.

I've not implemented both methods for the other bases. Just the simple method to discard the
fraction.

Keeping the fraction makes the code more complicated. It is also heading towards binary encoding
as per Commons codec where all the bits are vital. Although for that the sequence of the bytes
is also important. Here if the remainders are collected and used later the sequence is destroyed.

It may be a better idea to keep it simple and just discard the remainder. This will be easier
to support for {{long}} data sources too. In this case the table just doubles:
||Bits||Samples per long||Remainder||As Fraction||Optimum||Name||
|1|64|0|64|64 per long|Binary|
|2|32|0|32|32 per long|-|
|3|21|1|21 1/64|64 per 3 long|Octal|
|4|16|0|16|16 per long|Hex|
|5|12|4|12 4/64|64 per 5 long|-|
|6|10|4|10 4/64|32 per 3 long|Base64 ASCII armoured encoding|
|7|9|1|9 1/64|64 per 7 long|-|

> StringSampler
> -------------
>
>                 Key: RNG-54
>                 URL: https://issues.apache.org/jira/browse/RNG-54
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.1
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> There is currently no equivalent for the function {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // Get int(len/2)+1 random bytes
>     byte[] randomBytes = new byte[(len / 2) + 1];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         Integer c = Integer.valueOf(randomBytes[i]);
>         /*
>          * Add 128 to byte value to make interval 0-255 before doing hex conversion.
>          * This guarantees <= 2 hex digits from toHexString() toHexString would
>          * otherwise add 2^32 to negative arguments.
>          */
>         String hex = Integer.toHexString(c.intValue() + 128);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             hex = "0" + hex;
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created a version
based on the Hex encoding within {{org.apache.commons.codec.digest.DigestUtils}} and {{org.apache.commons.codec.binary.Hex}}.
This uses a direct look-up of the hex character using the index from successive 4 bits of
a byte array to form an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = { 
>     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes, int length)
{
>     rng.nextBytes(bytes);
>     // Use the upper and lower 4 bits of each byte as an
>     // index in the range 0-15 for each hex character.
>     final char[] out = new char[length];
>     // Run the loop without checking index j by producing characters
>     // up to the size below the desired length.
>     final int loopLimit = length / 2;
>     int i = 0, j = 0;
>     while (i < loopLimit) {
>         final byte b = bytes[i];
>         // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
>         out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
>         out[j++] = DIGITS_LOWER[b & 0x0F];
>         i++;
>     }
>     // The final character
>     if (j < length)
>         out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
>     return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that computes the
exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
>     byte[] randomBytes = new byte[(len + 1) / 2];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         // ORIGINAL
>         // Integer c = Integer.valueOf(randomBytes[i]);
>         // String hex = Integer.toHexString(c.intValue() + 128);
>         String hex = Integer.toHexString(randomBytes[i] & 0xff);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             outBuffer.append('0');
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>  
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary, octal and
hex strings.
> JUnit tests show: the sampler computes the same values as {{nextHexStringModified(int);}} edges
cases are handled with exceptions; and the output strings are uniform for each of the supported
character sets (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message