cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Slater (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12490) Add sequence distribution type to cassandra stress
Date Thu, 13 Oct 2016 10:58:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571589#comment-15571589
] 

Ben Slater edited comment on CASSANDRA-12490 at 10/13/16 10:57 AM:
-------------------------------------------------------------------

OK, so I did some more investigation this evening to try to better understand this and found
a few interesting things. I suspect there is at least on bug here but I'll be interested to
see what you think.

I set up a simple spec to test what was going on:
{code}
table: test4
table_definition: |
  CREATE TABLE test4 (
        pk text,
        val text,
        PRIMARY KEY (pk)
  ) 
columnspec:
  - name: pk
    size: fixed(32) 
    population: uniform(1..50)
{code}

When I run this with `ops(insert=1) n=50` the end result is 1 row added to the table. When
I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values from the population
(due to the small n=) results in a very low variation in values being returned from `delegate.sample()`
in `DistributionBoundApache.next()`. They were (all 37<x<38 so get round to 37 by `bound()`)
b) increasing n to 500 increases the number of rows to 3
c) session.execute() gets called n times despite the overlap (so it looks to me like it is
overwritting)
d) uses exp() instead of uniform also produces the same number of rows (but different values)

e) using seq() (new implementation) produces 50 rows with n=50
f) if I change the implementation of setSeed() in DistributionBoundApache to a null operation
(as a quick test, not the right fix) I get 31 rows with n=50 and 50 rows with n=500 which
is the behaviour I would have expected

I know that the small numbers aren't necessarily representative when we're talking about statistical
distributions but it seems the behaviour is far enough from what is expected to be indicative
of any issue (and I suspect this is actually the root of what cause me to create seq() in
the first place).

Feels like this is morphing into a different jira but I guess it makes sense to work out what
that is here before opening something new.

Be very interested to hear what you think.


was (Author: slater_ben):
OK, so I did some more investigation this evening to try to better understand this and found
a few interesting things. I suspect there is at least on bug here but I'll be interested to
see what you think.

I set up a simple spec to test what was going on:
```
table: test4
table_definition: |
  CREATE TABLE test4 (
        pk text,
        val text,
        PRIMARY KEY (pk)
  ) 
columnspec:
  - name: pk
    size: fixed(32) 
    population: uniform(1..50)```

When I run this with `ops(insert=1) n=50` the end result is 1 row added to the table. When
I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values from the population
(due to the small n=) results in a very low variation in values being returned from `delegate.sample()`
in `DistributionBoundApache.next()`. They were (all 37<x<38 so get round to 37 by `bound()`)
b) increasing n to 500 increases the number of rows to 3
c) session.execute() gets called n times despite the overlap (so it looks to me like it is
overwritting)
d) uses exp() instead of uniform also produces the same number of rows (but different values)

e) using seq() (new implementation) produces 50 rows with n=50
f) if I change the implementation of setSeed() in DistributionBoundApache to a null operation
(as a quick test, not the right fix) I get 31 rows with n=50 and 50 rows with n=500 which
is the behaviour I would have expected

I know that the small numbers aren't necessarily representative when we're talking about statistical
distributions but it seems the behaviour is far enough from what is expected to be indicative
of any issue (and I suspect this is actually the root of what cause me to create seq() in
the first place).

Feels like this is morphing into a different jira but I guess it makes sense to work out what
that is here before opening something new.

Be very interested to hear what you think.

> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. This ensures
generated values don't overlap (unless the sequence wraps) providing more predictable number
of inserted records (and generating a base set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. It think
it would be useful to have this for doing initial load of data for testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message