commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex D Herbert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RNG-54) StringSampler
Date Thu, 16 Aug 2018 16:36:00 GMT

    [ https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582795#comment-16582795
] 

Alex D Herbert commented on RNG-54:
-----------------------------------

I just had a look at commons Text and that currently has no dependencies on commons RNG.

However the class {{org.apache.commons.text.RandomStringGenerator}} can create a random string
using any source of randomness that can create nextInt(int max). The Javadoc recommends using commons
RNG to provide this function. However it will fall back to {{java.util.concurrent.ThreadLocalRandom}}
if missing.

The class actually provides random functionality for strings with any character set (passed
as a list), or using a range of character code points.

There is also {{org.apache.commons.lang3.RandomStringUtils}} which also supports picking from
an array of char. So I've added these to a benchmark using JMH.

There is also the options of using Commons Codec to encode the bytes array into something.
So I've added this to the test.

All methods were called with the same RNG (wrapped if appropriate) and restored to the same
sate before each run. The table shows the relative speed using the average of 10 runs generating
10000 random strings of the given length:
||Benchmark||Benchmark||Length||Relative||Score||Score Error (99.9%)||Source||
|Codec|NextBytesAndHexString|32|0.070|655.422|134.936|MWC_256|
|Math|RandomDataGenerator_nextHexStringModified|32|0.493|4580.968|593.928|MWC_256|
|Math|RandomDataGenerator_nextHexStringOriginal|32|0.644|5989.369|1157.043|MWC_256|
|Text|RandomStringGeneratorWith16CodePoints|32|0.391|3635.981|492.750|MWC_256|
|Text|RandomStringGeneratorWithCharArray|32|1.000|9300.704|1807.697|MWC_256|
|Lang|RandomStringUtilsWith16CodePoints|32|0.242|2247.133|352.087|MWC_256|
|Lang|RandomStringUtilsWithCharArray|32|0.217|2015.916|424.406|MWC_256|
|new|StringSampler|32|0.064|595.156|3.724|MWC_256|
|Codec|NextBytesAndHexString|1024|0.063|17660.668|2972.286|MWC_256|
|Math|RandomDataGenerator_nextHexStringModified|1024|0.575|162216.015|42357.048|MWC_256|
|Math|RandomDataGenerator_nextHexStringOriginal|1024|0.725|204565.773|27611.786|MWC_256|
|Text|RandomStringGeneratorWith16CodePoints|1024|0.396|111745.938|8382.560|MWC_256|
|Text|RandomStringGeneratorWithCharArray|1024|1.000|282045.389|51805.233|MWC_256|
|Lang|RandomStringUtilsWith16CodePoints|1024|0.268|75699.276|10016.061|MWC_256|
|Lang|RandomStringUtilsWithCharArray|1024|0.198|55732.681|195.122|MWC_256|
|new|StringSampler|1024|0.062|17350.684|3769.534|MWC_256|

So:
 * RandomStringGenerator with a {{char[]}} is slower than the other implementations in other
commons packages. This is because the char is generated by conversion of a character to a
string then to a code point. The method that uses a range of code points directly is much
faster.
 * The old Commons Math implementation is the next slowest. It converts to a Hex character
using {{Integer.toHexString}} on each byte which generates 1 or 2 hex characters.
 * Commons Lang RandomStringUtils is fastest of the current implementations that support flexible
charsets.
 * StringSampler is fastest. 2x faster than Commons Lang for small strings and 4x faster
than Commons Lang for long strings.
 * However using Commons Codec to encode the byte array is basically the same speed.

All the current implementations use a {{StringBuilder}} to create the string and use {{nextInt(int)}} once
per character.

{{StringSampler}} directly writes to a {{char[]}} and uses the {{nextBytes(byte[] bytes)}}
method of {{UniformRandomProvider}}. The implementation is actually a reverse of the packing
of int or long values into a byte array performed by the base {{org.apache.commons.rng.core.source32.IntProvider}}
and {{org.apache.commons.rng.core.source64.LongProvider}}.

This is basically matched by using Commons Codec and so if anything the nextHexString method
should probably be added to that library.

 

> StringSampler
> -------------
>
>                 Key: RNG-54
>                 URL: https://issues.apache.org/jira/browse/RNG-54
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.1
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> There is currently no equivalent for the function {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // Get int(len/2)+1 random bytes
>     byte[] randomBytes = new byte[(len / 2) + 1];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         Integer c = Integer.valueOf(randomBytes[i]);
>         /*
>          * Add 128 to byte value to make interval 0-255 before doing hex conversion.
>          * This guarantees <= 2 hex digits from toHexString() toHexString would
>          * otherwise add 2^32 to negative arguments.
>          */
>         String hex = Integer.toHexString(c.intValue() + 128);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             hex = "0" + hex;
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created a version
based on the Hex encoding within {{org.apache.commons.codec.digest.DigestUtils}} and {{org.apache.commons.codec.binary.Hex}}.
This uses a direct look-up of the hex character using the index from successive 4 bits of
a byte array to form an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = { 
>     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes, int length)
{
>     rng.nextBytes(bytes);
>     // Use the upper and lower 4 bits of each byte as an
>     // index in the range 0-15 for each hex character.
>     final char[] out = new char[length];
>     // Run the loop without checking index j by producing characters
>     // up to the size below the desired length.
>     final int loopLimit = length / 2;
>     int i = 0, j = 0;
>     while (i < loopLimit) {
>         final byte b = bytes[i];
>         // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
>         out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
>         out[j++] = DIGITS_LOWER[b & 0x0F];
>         i++;
>     }
>     // The final character
>     if (j < length)
>         out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
>     return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that computes the
exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
>     byte[] randomBytes = new byte[(len + 1) / 2];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         // ORIGINAL
>         // Integer c = Integer.valueOf(randomBytes[i]);
>         // String hex = Integer.toHexString(c.intValue() + 128);
>         String hex = Integer.toHexString(randomBytes[i] & 0xff);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             outBuffer.append('0');
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>  
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary, octal and
hex strings.
> JUnit tests show: the sampler computes the same values as {{nextHexStringModified(int);}} edges
cases are handled with exceptions; and the output strings are uniform for each of the supported
character sets (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message