cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Slater (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12490) Add sequence distribution type to cassandra stress
Date Tue, 11 Oct 2016 01:52:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564148#comment-15564148
] 

Ben Slater commented on CASSANDRA-12490:
----------------------------------------

Yes, you're right resetting the counter to zero on setSeed() does result in the same row being
generated over and over again (which does make me wonder how stress is respecting the distribution
for the PK value but didn't investigate at this point). However, that is pretty easily fixed
by having setSeed() set the counter to the supplied seed value. I think once we do this SEQ
behaves very similarly to the other distributions.

I don't think it's correct that stress generates every value if the number of unique values
it can generate is <= the number of values it is being asked to generate for a partition.
This would only respect the distribution in the case of uniform distribution, however even
then I don't think it's guaranteed to be completely uniform (and thus generate all values)
from n samples of a 1..n distribution (you probably need to do many * n to get very close
to uniform) - it certainly doesn't seem to behave this way in testing. For say normal distribution
you'd need several * n to cover all the possible values and have close to a normal distribution.

I afraid I don't really understand why you think this is abusing the notion of distributions
when (a) there was already a sequence distribution type in the "legacy" distribution sets
(presumably for just this purpose) and (b) to me, one way of describing this is a uniform
distribution with minimal chance of collisions (ie it's just another way for selecting values
from a range).

Finally, it's not quite correct to say I'm trying to populate all possible values for a column,
rather trying to generate as many unique values as possible (within the specified ranges)
for a given sample size (to minimise overwriting).

> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. This ensures
generated values don't overlap (unless the sequence wraps) providing more predictable number
of inserted records (and generating a base set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. It think
it would be useful to have this for doing initial load of data for testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message