cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Slater (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12744) Randomness of stress distributions is not good
Date Sun, 28 May 2017 08:05:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027722#comment-16027722
] 

Ben Slater edited comment on CASSANDRA-12744 at 5/28/17 8:04 AM:
-----------------------------------------------------------------

So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with n=10000 you will
end up with 10,000 rows inserted when I think the actual functional guarantee is that it will
run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred thousand records.
Even with n=1M you end up with 999,999 records in the table. For some reason, change to the
library default Well19937c generator means no only is the assumption broken at n=10k but seem
to get proportional worse as n increases.

So, on those findings, I don't think changing the generator is a good idea.

So, I tried to dig a bit deeper about what was causing the issue. As part of this, I wrote
some code to generate values directly from the distributions in various ways and the results
all seemed as expected (ie reasonably aligned with the distribution type). 

After a bit more digging, and to cut a long story short, I found that the actual is related
to the -pop setting. I'm still a bit hazy on this but it seems -pop is the distribution of
all possible keys. So, if I have a -pop of dist(1..10) I can only have 10 possible key values
(ie combinations across all columns) no matter what the ranges specified for the key column
in the YAML file are. The default for -pop is UNIFORM(1..n) where n is specified or 1..1,000,000
where no n is specified. I think this all results in somewhat counter-intuitive results, particular
with multi-part keys.

So, I think the actual answer here is to change the rules for the default -pop  for yaml runs
to have a population size equal to the product of the population size of each key as specified
in the YAML.  For example, if I have two columns: 
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)

then the default population should be 1..100M. I think this is already implied by the YAML
and what people would expect (certainly what I expected). I've done a few tests manual setting
the pop and it seems to do what's expected.

I don't think this change will be too hard to make but interested to hear if anyone has an
opinions before I jump into it.


was (Author: slater_ben):
So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with n=10000 you will
end up with 10,000 rows inserted when I think the actual functional guarantee is that it will
run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred thousand records.
Even with n=1M you end up with 999,999 records in the table. For some reason, change to the
library default Well19937c generator means no only is the assumption broken at n=10k but seem
to get proportional worse as n increases.

So, on those findings, I don't think changing the generator is a good idea.

So, I tried to dig a bit deeper about what was causing the issue. As part of this, I wrote
some code to generate values directly from the distributions in various ways and the results
all seemed as expected (ie reasonably aligned with the distribution type). 

After a bit more digging, and to cut a long story short, I found that the actual is related
to the -pop setting. I'm still a bit hazy on this but it seems -pop is the distribution of
all possible keys. So, if I have a -pop of dist(1..10) I can only have 10 possible key values
(ie combinations across all columns) no matter what the ranges specified for the key column
in the YAML file are. The default for -pop is UNIFORM(1..n) where n is specified or 1..1,000,000
where no n is specified. I think this all results in somewhat counter-intuitive results, particular
with multi-part keys.

So, I think the actual answer here is to change the rules for the default -pop  for yaml runs
to have a population size equal to the product of the population size of each key as specified
in the YAML.  For example, if I have two columns: 
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)

then the default population should be 1..100M. I think this is already implied by the YAML
and what people would expect (certainly what I expected).

I don't think this change will be too hard to make but interested to hear if anyone has an
opinions before I jump into it.

> Randomness of stress distributions is not good
> ----------------------------------------------
>
>                 Key: CASSANDRA-12744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: T Jake Luciani
>            Assignee: Ben Slater
>            Priority: Minor
>              Labels: stress
>             Fix For: 4.0
>
>
> The randomness of our distributions is pretty bad.  We are using the JDKRandomGenerator()
but in testing of uniform(1..3) we see for 100 iterations it's only outputting 3.  If you
bump it to 10k it hits all 3 values. 
> I made a change to just use the default commons math random generator and now see all
3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message