cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12490) Add sequence distribution type to cassandra stress
Date Thu, 13 Oct 2016 11:10:22 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571614#comment-15571614
] 

Benedict edited comment on CASSANDRA-12490 at 10/13/16 11:09 AM:
-----------------------------------------------------------------

_partition_ keys are a distinct beast, and if your population distribution for these is tiny
then yes you will get overwrites, and I'm really not sure there's anything we can _reliably_
do about that.  Mostly I've been talking about behaviour _within_ a partition (except when
pointing out some breakages).

The command line "-pop" property specifies the population of unique partition _seeds_.  These
have to be translated into the partition key population distribution(s) first, which then
between them uniquely produce the partition's contents (different seeds hitting the same PK
will produce the same entire partition).  The problem is that the size of the unique seed
set could be gigantic (we let n be billions in size, and it is often necessary to run with
datasets this large), so enumerating all of these unique seeds and determining their value
in the partition key column population distributions would be prohibitively expensive.  So
we just accept that users should sensibly ensure their partition key population distribution
is large enough to accommodate enough random samples to fulfil their seed population.

Now, for small populations we *could* mode-switch.  But I'm not sure it ever makes sense to
so materially constrain your partition key population distribution.  It might even make sense
for stress to forbid constraining this distribution too much, as it has essentially no impact
to the behaviour profile of the cluster.

If you want to visit a single partition many times, there are better ways to do that. i.e.,
specifying that the seed population as small, but that you want to run many operations.  This
will give you an identically constrained population, without any risk of weirdness, as well
as permitting the same yaml to be used for different scales of test.  ideally you want each
visit, in such a scenario, to use the more advanced features of stress anyway (such as partial
visitation of the whole generated (presumably huge) partition, or incremental visitation)


was (Author: benedict):
_partition_ keys are a distinct beast, and if your population distribution for these is tiny
then yes you will get overwrites, and I'm really not sure there's anything we can _reliably_
do about that.  Mostly I've been talking about behaviour _within_ a partition (except when
pointing out some breakages).

The command line "-pop" property specifies the population of unique partition _seeds_.  These
have to be translated into the partition key population distribution(s) first, which then
between them identify the partition's contents.  The problem is that the size of the unique
seed set could be gigantic (we let n be billions in size, and it is often necessary to run
with datasets this large), so enumerating all of these unique seeds and determining their
value in the partition key column population distributions would be prohibitively expensive.
 So we just accept that users should sensibly ensure their partition key population distribution
is large enough to accommodate enough random samples to fulfil their seed population.

Now, for small populations we *could* mode-switch.  But I'm not sure it ever makes sense to
so materially constrain your partition key population distribution.  It might even make sense
for stress to forbid constraining this distribution too much, as it has essentially no impact
to the behaviour profile of the cluster.

If you want to visit a single partition many times, there are better ways to do that. i.e.,
specifying that the seed population as small, but that you want to run many operations.  This
will give you an identically constrained population, without any risk of weirdness, as well
as permitting the same yaml to be used for different scales of test.  ideally you want each
visit, in such a scenario, to use the more advanced features of stress anyway (such as partial
visitation of the whole generated (presumably huge) partition, or incremental visitation)

> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. This ensures
generated values don't overlap (unless the sequence wraps) providing more predictable number
of inserted records (and generating a base set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. It think
it would be useful to have this for doing initial load of data for testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message