cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Cranford (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-13932) Stress write order and seed order should be different
Date Tue, 03 Oct 2017 20:27:00 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Cranford updated CASSANDRA-13932:
----------------------------------------
    Summary: Stress write order and seed order should be different  (was: Write order and
seed order should be different)

> Stress write order and seed order should be different
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13932
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13932
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Daniel Cranford
>              Labels: stress
>         Attachments: 0001-Initial-implementation-cassandra-3.11.patch, vmtouch-after.txt,
vmtouch-before.txt
>
>
> Read tests get an unrealistic boost in performance because they read data from a set
of partitions that was written sequentially.
> I ran into this while running a timed read test against a large data set (250 million
partition keys) {noformat}cassandra-stress read duration=30m{noformat} While the test was
running, I noticed one node was performing zero IO after an initial period.
> I discovered each node in the cluster only had blocks from a single SSTable loaded in
the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat}
> For the node that was performing zero IO, the SSTable in question was small enough to
fit into the FS cache.
> I realized that when a read test is run for a duration or until rate convergenge, the
default population for the seeds is a GAUSSIAN distribution over the first million seeds.
Because of the way compaction works, partitions that are written sequentially will (with high
probability) always live in the same SSTable. That means that while the first million seeds
will generate partition keys that will be randomly distributed in the token space, they will
most likely all live in the same SSTable. When this SSTable is small enough to fit into the
FS cache, you get unbelievably good results for a read test. Consider that a dataset 4x the
size of the FS cache will have almost 1/2 the data in SSTables small enough to fit into the
FS cache.
> Adjusting the population of seeds used during the read test to be the entire 250 million
seeds used to load the cluster does not fix the problem.{noformat}cassandra-stress read duration=30m
-pop dist=gaussian(1..250M){noformat}
> or (same population, larger sample) {noformat}cassandra-stress read n=250M{noformat}
> Any distribution other than the uniform distribution has one or more modes, and the mode(s)
of such a distribution will cluster reads around a certain seed range which corresponds to
a certain set of sequential writes which corresponds to (with high probability) a single SSTable.
> My patch against cassandra-3.11 fixes this by shuffling the sequence of generated seeds.
Each seed value will still be generated once and only once. The old behavior of sequential
seed generation (ie seed(n+1) = seed( n) + 1) may be selected by using the no-shuffle flag.
e.g. {noformat}cassandra-stress read duration=30m -pop no-shuffle{noformat}
> Results: In [^vmtouch-before.txt] only pages from a single SSTable are present in the
FS cache while in [^vmtouch-after.txt] an equal proportion of all SSTables are present in
the FS cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message