cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9644) DTCS configuration proposals for handling consequences of repairs
Date Wed, 24 Jun 2015 09:20:05 GMT


Marcus Eriksson commented on CASSANDRA-9644:

Thanks for the report, and yes, there are a bunch of things we should do. Digesting your post,
I think these are the most important ones:

bq. Maximum length of the time window inside which the files are assigned for a certain bucket
(not currently defined). This means that expansion of time window length is restricted. When
the limit is reached the window size will be same all the way back in the history (e.g. one
yes, we should make it possible to put a cap on how big the windows can get

bq. Inside the bucket when the amount of files is limited by max_threshold, the compaction
would select first small files instead of one huge file and a bunch of small files.
yes, we should do STCS within the windows when we have more than max_threshold files

bq. Optional strategies to select the most interesting bucket
we should probably prioritize compaction in the windows that actually serve reads, something
similar to STCS.mostInterestingBucket

Should also note that using incremental repair makes sure that we don't stream in a bunch
of data in the old windows

> DTCS configuration proposals for handling consequences of repairs
> -----------------------------------------------------------------
>                 Key: CASSANDRA-9644
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Antti Nissinen
>             Fix For: 3.x, 2.1.x
>         Attachments: node0_20150621_1646_time_graph.txt, node0_20150621_2320_time_graph.txt,
node0_20150623_1526_time_graph.txt, node1_20150621_1646_time_graph.txt, node1_20150621_2320_time_graph.txt,
node1_20150623_1526_time_graph.txt, node2_20150621_1646_time_graph.txt, node2_20150621_2320_time_graph.txt,
node2_20150623_1526_time_graph.txt, nodetool status infos.txt, sstable_compaction_trace.txt,
sstable_compaction_trace_snipped.txt, sstable_counts.jpg
> This is a document bringing up some issues when DTCS is used to compact time series data
in a three node cluster. The DTCS is currently configured with a few parameters that are making
the configuration fairly simple, but might cause problems in certain special cases like recovering
from the flood of small SSTables due to repair operation. We are suggesting some ideas that
might be a starting point for further discussions. Following sections are containing:
> - Description of the cassandra setup
> - Feeding process of the data
> - Failure testing
> - Issues caused by the repair operations for the DTCS
> - Proposal for the DTCS configuration parameters
> Attachments are included to support the discussion and there is a separate section giving
explanation for those.
> Cassandra setup and data model
> - Cluster is composed from three nodes running Cassandra 2.1.2. Replication factor is
two and read and write consistency levels are ONE.
> - Data is time series data. Data is saved so that one row contains a certain time span
of data for a given metric ( 20 days in this case). The row key contains information about
the start time of the time span and metrix name. Column name gives the offset from the beginning
of time span. Column time stamp is set to correspond time stamp when adding together the timestamp
from the row key and the offset (the actual time stamp of data point). Data model is analog
to KairosDB implementation.
> - Average sampling rate is 10 seconds varying significantly from metric to metric.
> - 100 000 metrics are fed to the Cassandra.
> - max_sstable_age_days is set to 5 days (objective is to keep SStable files in manageable
size, around 50 GB)
> - TTL is not in use in the test.
> Procedure for the failure test.
> - Data is first dumped to Cassandra for 11 days and the data dumping is stopped so that
DTCS will have a change to finish all compactions. Data is dumped with "fake timestamps" so
that column time stamp is set when data is written to Cassandra.
> - One of the nodes is taken down and new data is dumped on top of the earlier data covering
couple of hours worth of data (faked time stamps).
> - Dumping is stopped and the node is kept down for few hours.
> - Node is taken up and the "nodetool repair" is applied on the node that was down.
> Consequences
> - Repair operation will lead to massive amount of new SStables far back in the history.
New SStables are covering similar time spans than the files that were created by DTCS before
the shutdown of one of the nodes.
> - To be able to compact the small files the max_sstable_age_days should be increased
to allow compaction to handle the files. However, the in a practical case the time window
will increase so large that generated files will be huge that is not desirable. The compaction
also combines together one very large file with a bunch of small files in several phases that
is not effective. Generating really large files may also lead to out of disc space problems.
> - See the list of time graphs later in the document.
> Improvement proposals for the DTCS configuration
> Below is a list of desired properties for the configuration. Current parameters are mentioned
if available.
> - Initial window size (currently:base_time_seconds)
> - The amount of similar size windows for the bucketing (currently: min_threshold)
> - The multiplier for the window size when increased (currently: min_threshold). This
we would like to be independent from the min_threshold parameter so that you could actually
control the rate how fast the window size is increased.
> - Maximum length of the time window inside which the files are assigned for a certain
bucket (not currently defined). This means that expansion of time window length is restricted.
When the limit is reached the window size will be same all the way back in the history (e.g.
one week)
> - The maximum horizon in which SStables are candidates for buckets (currently: max_sstable_age_days)
> - Maximum file size of SStable allowed to be in a set of files to be compacted (not possible
currently). Preventing out of disk space situations.
> - Optional strategies to select the most interesting bucket:
>     - Minimum amount of SStables in the time window before it is a candidate for the
most interesting bucket (currently: min_threshold for the most recent window, otherwise two).
Being able set this value independently would allow to put most of the efforts on those areas
where a large amount of small files should be compacted together instead of few new files.
>     - Optionally, the criteria for the most interesting bucket could be set: e.g. select
the window with most files to be compacted.
>     - Inside the bucket when the amount of files is limited by max_threshold, the compaction
would select first small files instead of one huge file and a bunch of small files.
> The above set of parameters allows to recover from repair operations producing large
amount of small SStables.
> - Maximum length of the time window for compactions would keep the compacted SStable
size in reasonable range and would allow to extend the horizon far back in the history
> - Combining small files together instead of combining one huge file with e.g. 31 small
files again and again is more disk efficient
> In addition to the previous advantages the above parameters would also allow:
> - Dumping of more data in the history (e.g. new metrics) by assigning the correct timestamp
for the column (fake time stamp) and proper compaction of new and existing SStables.
> - Expiring reasonable size SStable with TTL even if the compactions would be intermittently
executed far back in the history. In this case the new data has to fed with TTL calculated
> - Note: Being able to give the absolute time stamp for the column expiry time would be
beneficial when data is dumped back in the history. This is the case when you move data from
some legacy system to Cassandra with faked time stamps and would like to keep the data only
a certain time period.  Currently the absolute time stamp is calculated by Cassandra from
the system time and given TTL. TTL has to be calculated dynamically based on the current time
and desired expiry moment making things more complex.
> One interesting question is that why those duplicate SStable files are created? The duplication
problem could not be produced when the data was dumped with following spec:
> - 100 metrics
> - 20 days of data in one row
> - one year of data
> - max_sstable_age_days = 15
>  - memtable_offheap_space_in_mb was decreased so that small SStables were created (to
create something to be compacted)
>  - One node was taken down and one more day of data was dumped on top of the earlier
> - "nodetool repair -pr" was executed on each node => duplicates were checked in each
step => no duplicates
> - "nodetool repair" was executed on a node that was down => no duplicates were generated
> --------------------------------------------------------------------------------------------------------------------
> Time graphs of content of SSTables from different phases of the test run:
> *************************************************************
> Fields in the below time graphs are following:
> - Order number from the  SSTable file name
> - Minimum column timestamp in the SSTable file
> - Timespan representation graphically
> - Maximum column time stamp in SStable
> - The size of the SStable in megabytes
> Time graphs after dumping the 11 days of data and letting all compactions to run through
> node0_20150621_1646_time_graph.txt
> node1_20150621_1646_time_graph.txt
> node2_20150621_1646_time_graph.txt (error: same as for node1, but the behavior is same)
> Time graphs after taking one node down (node2) and dumping couple of hours of mode data
> node0_20150621_2320_time_graph.txt
> node1_20150621_2320_time_graph.txt
> node2_20150621_2320_time_graph.txt 
> Time graphs when the repair operation has finished and compactions are done. Compactions
will naturally handle only the files inside the max_sstable_age_days range.
> ==> Now there is a large amount of small files covering pretty much same areas as
the original SStables
> node0_20150623_1526_time_graph.txt
> node1_20150623_1526_time_graph.txt
> node2_20150623_1526_time_graph.txt
> -----------
> Trend from the SStable count as a function of time on each node.
> sstable_counts.jpg
> Vertical lines:
> 1) Clearing the database and dumping the 11 days worth of data
> 2) Stopping the dumping and letting compactions run
> 3) Taking one node down (top bottom one in figure) and dumping few hours of new data
on top of earlier data
> 4) Starting the repair operation
> 5) Repair operation finished
> --------
> Nodetool status prints before and after repair operation
> nodetool status infos.txt
> --------------
> Tracing compactions
> Log files were parsed to demonstrate the creation of new small SStables and the combination
of one large file with a bunch of small ones. This is done from the time range where the max_sstable_age_days
is able to reach (in this case 5 days). The hierarchy of the files is shown in the file "sstable_compaction_trace.txt".
The first flushed file can be found from the line 10.
> Each line represents either a flushed SStable or the SStable created by DTCS. For flushed
files the timestamp indicates the time period the file represents. For compacted files (marked
with C) the first timestamp represents the moment when the compaction was done (wall clock).
Time stamps are faked when written to the database. The size of the file is the last field.
The first field with number in parenthesis shows the level of the file. Top level files marked
with (0) are those that don't have any predecessors and should be found from the disk also.
> SStables that are created by the repair operation are not mentioned in the log files
so they are handled as phantom files. The existence of file can be concluded from the predecessor
list of compacted SStable. Those are marked with None,None in timestamps.
> In the file "sstable_compaction_trace_snipped.txt" is one portion that shows the compaction
hierarchy for the small files originating from the repair operation. max_threshold is in the
default value of 32. In each step 31 tiny files are compacted together with 46 GB file.
> sstable_compaction_trace.txt
> sstable_compaction_trace_snipped.txt

This message was sent by Atlassian JIRA

View raw message