cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Cranford (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good
Date Thu, 05 Oct 2017 12:50:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192820#comment-16192820
] 

Daniel Cranford commented on CASSANDRA-12744:
---------------------------------------------

Some more thoughts:

The generation of partition keys has been "broken" since CASSANDRA-7519

The Linear Congruential Generators (LCGs) used in java.util.Random and by extension JDKRandomGenerator
generate good random number sequences, but similar seeds result in similar sequences. Using
the lcg update function {{lcg\(x) = a*x + c}} like

random ~1~ = lcg(1)
random ~2~ = lcg(2)
random ~3~ = lcg(3)
...
random ~n~ = lcg\(n)

does not generate a good random sequence, this is a misuse of the LCG. LCGs are supposed to
be used like

random ~1~ = lcg(1)
random ~2~ = lcg(lcg(1))
random ~3~ = lcg(lcg(lcg(1)))
...
random ~n~ = lcg ^n^ (1)

I say "broken" in quotes because the misuse of LCGs ends up not mattering. {{new java.util.Random(seed).nextDouble()}}
will always differ from {{new java.util.Random(seed + 1).nextDouble()}} by more than 1/100,000,000,000
Thus with the default partition key population (=UNIFORM(1..100B)), seeds that differ by 1
will generate distinct partition keys.

The only thing that matters about partition keys is how many distinct values there are (and
how large their lexical value is). The number of partition key components doesn't matter.
The cardinality of each partition key component doesn't matter. The distribution of values
in the lexical partition key space doesn't matter.

At the end of the day, all the partition key components get concatenated and the resulting
bit vector is hashed resulting in a uniformly distributed 64 bit token that determines where
the data will be stored.

The easiest "fix" is to not use the partition key population to define the number of partition
keys. Take advantage of the fact that the only thing that matters from a performance standpoint
is the number of distinct partitions. Leave the partition key distribution at uniform(1..100B),
and use the n= parameter to define the number of partitions.

An ideal fix would update the way partition keys are generated to use the LCG generator properly.
However, this seems difficult since LCGs don't support random access (i.e., the only way to
calculate the nth item in an LCG sequence is to first calculate the n-1 preceding items),
and all three seed generation modes rely on the ability to randomly jump around in the seed
sequence. This could be worked around by using a PRNG that supports random access to the n'th
item in the sequence (e.g. something like JDK 1.8's SpittableRandom could be easily extended
to support this).

A more workable fix is to spread the generated seeds (typically drawn from a smallish range
of integers) around in the 2 ^64^ values a long can take before seeding the LCG. An additional
caveat to whatever function is used for spreading the seeds needs to be invertable since LookbackableWriteGenerator's
implementation relies on the properties of the sequence it generates to perform internal bookeeping.

Multiplication by an odd integer happens to be an invertable function (although integer division
is NOT the inverse operation, multiplication by the modular inverse is). So the initial implementation
(although broken) is not actually that bad an idea. I propose fixing things by picking a static
integer as the multiplier and using multiplication by it's modular inverse to invert it for
LookbackableWriteGenerator


> Randomness of stress distributions is not good
> ----------------------------------------------
>
>                 Key: CASSANDRA-12744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: T Jake Luciani
>            Assignee: Ben Slater
>            Priority: Minor
>              Labels: stress
>             Fix For: 4.0
>
>         Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the JDKRandomGenerator()
but in testing of uniform(1..3) we see for 100 iterations it's only outputting 3.  If you
bump it to 10k it hits all 3 values. 
> I made a change to just use the default commons math random generator and now see all
3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message