cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Shook (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10742) Real world DateTieredCompaction tests
Date Tue, 24 Nov 2015 13:48:10 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024539#comment-15024539
] 

Jonathan Shook commented on CASSANDRA-10742:
--------------------------------------------

[~krummas],

Some notes on test setup, and some observations from data models we've seen. We can try to
get some additional details from willing users if this doesn't get us close enough.

The baseline test I use is high-ingest, read-most-recent, with some read-cold mixed-in. The
idea is to simulate the typical access patterns of time-series telemetry with roll-up processing,
with the occasional historic query or reprocessing of old data. I use 90/10/1 ratio for write/recent-read/cold-read
as a starting point. I usually back off the ingest rate from a saturating load in order to
find a stable steady-state reference point. This still is much higher load per-node than you
would often have in a production scenario. It does provide for good contrast with trade-offs,
like compaction load. Often, you will be accumulating data over a longer period of time, so
ingest rates that approach the reasonable saturating load are closer to stress tests than
real-world. As such, they are still good tests. If you can run a node at 10x to 1000x the
data rates that you would expect in production, then 1) you can complete the test in a reasonable
amount of time and 2) you're not too worried about the margin of error.

The data model I use is essentially ((datasource, timebucket), parametername, timestamp) ->
value, although future testing will likely drop the timebucket component, relying instead
on the time-based layout of sstables as a simplification. (Still needs supporting data from
tests). parametername is just a variable name that is associated with a type of measurement.
This is selected from a fixed set, as is often the case in the wild. The value can vary in
type and size according to the type of data logging. I use a range from 1k to 5k, depending
on the type of test. In the simplest cases, a value is an int or float, but it can also be
a log line from a stack trace.

The model of writes/read-most-recent/read-cold can cover lots of ground in terms of time-series.
The ratios can be varied. Also, the number of partitions per node in conjunction with the
number of parameters should vary. In some cases in the wild, time-series partitions are single-series.
In other cases, they can have hundreds of related series by name (by cluster). In some cases,
the parameters associated with a data source are distributed by partition to support async
loading the cluster for responsive reads of significant data. To cover this, simply move the
parenthesis right by one term above.

If you cover some of the permutations above for op ratios, clustering structure, grain of
partition, and payload size, then you'll be covering lots of the space we see in practice.


> Real world DateTieredCompaction tests
> -------------------------------------
>
>                 Key: CASSANDRA-10742
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10742
>             Project: Cassandra
>          Issue Type: Test
>            Reporter: Marcus Eriksson
>
> So, to be able to actually evaluate DTCS (or TWCS) we need stress profiles that are similar
to something that could be found in real production systems.
> We should then run these profiles for _weeks_, and do regular operational tasks on the
cluster - like bootstrap, decom, repair etc.
> [~jjirsa] [~jshook] (or anyone): could you describe any write/read patterns you have
seen people use with DTCS in production?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message