cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antti Nissinen (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10280) Make DTCS work well with old data
Date Wed, 23 Sep 2015 12:53:04 GMT


Antti Nissinen commented on CASSANDRA-10280:

I am also voting for discarding the max_sstable_age_days and limiting the compaction window
size in DTCS. If the DTCS will have a major modifications then adopting the some of the ideas
from TWCS would be beneficial and also trying to take into account the practical view points
presented in several Jira items:

- limiting the window size in DTCS (this item, [CASSANDRA-10280|])

- using STCS in the newest window or if the amount of files exceeds the max_threshold ([CASSANDRA-10276|],[CASSANDRA-9666|])

- while compacting a large amount of files, start from small ones and progress towards larger
ones (especially in the case of small sstables originated from repair operations) [CASSANDRA-9597|]

- setting limits for number of files compacted in one shot based on the sum of files sizes
(not trying to compact several large files at ones and running out of disk space during the
operation) [CASSANDRA-10195|]

- round-robin approach for the selection of compaction window inside which next compaction
will be executed. Target is to get rid of small files as soon as possible. At the moment TWCS
and DTCS work with newer windows and progress towards the history when finished with the current
one [CASSANDRA-10195|]

Should we actually create a Jira item where we would collect the ideas for "ultimate time
series compaction strategy" for more detailled discussion? At the moment these ideas are scattered
around different items. Probably the above list is missing many of the relevant points.

Another important goal (our wish) for the time series data base is to able to wipe off data
effectively so that disk space would be released as soon as possible. I tried to describe
those ideas in [CASSANDRA-10306|], but
there is no comments yet on that item. The main idea was to have a possibility split SSTables
along a certain time line on all nodes so that SSTables could be dropped (like with TTL in
DTCS and TWCS) or archived on different media where they can be digged up on some day if really
needed. Deleting data efficiently on demand is presently one of the biggest obstacles for
using C* in closed environments with fairly limited hardware resources for time series data
collection. TTL is a working solution when you can predict data collection demands well before
hand and have additional resources available if predictions don't match with the reality.

What are the biggest obstacles in the present architecture for the below scenario?
- Decide a time stamp for the data deletion / archiving
- All existing SSTables on each node would be split to two files along the time line if the
SSTable covers data on both sides of the time line.
- SSTables falling behind the timeline would be inactivated from the SSTable set (not participating
any more on compactions or returning data on queries)
- you can decide if you want copy the files somewhere else or just simply delete those
- This tool could be used through the nodetool with external script

> Make DTCS work well with old data
> ---------------------------------
>                 Key: CASSANDRA-10280
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>             Fix For: 3.x, 2.1.x, 2.2.x
> Operational tasks become incredibly expensive if you keep around a long timespan of data
with DTCS - with default settings and 1 year of data, the oldest window covers about 180 days.
Bootstrapping a node with vnodes with this data layout will force cassandra to compact very
many sstables in this window.
> We should probably put a cap on how big the biggest windows can get. We could probably
default this to something sane based on max_sstable_age (ie, say we can reasonably handle
1000 sstables per node, then we can calculate how big the windows should be to allow that)

This message was sent by Atlassian JIRA

View raw message