cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jerome <jeromefroel...@hotmail.com>
Subject Understanding when Cassandra drops expired time series data
Date Fri, 17 Jun 2016 18:52:23 GMT
Hello! Recently I have been trying to familiarize myself with Cassandra but don't quite understand
when data is removed from disk after it has been deleted. The use case I'm particularly interested
is expiring time series data with DTCS. As an example, I created the following table:

CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND
     default_time_to_live = 86400 AND
     gc_grace_seconds = 3600 AND
     compaction = {
      'class': 'DateTieredCompactionStrategy',
      'timestamp_resolution':'MICROSECONDS',
      'base_time_seconds':'3600',
      'max_sstable_age_days':'365',
      'min_threshold':'4'
     };


I understand that Cassandra will create a tombstone for all rows inserted into this table
24 hours after they are inserted (86400 seconds). These tombstones will first be written to
an in-memory Memtable and then flushed to disk as an SSTable when the Memtable reaches a certain
size. My question is when will the data that is now expired be removed from disk? Is it the
next time the SSTable which contains the data gets compacted? So, with DTCS and min_threshold
set to four, we would wait until at least three other SSTables are in the same time window
as the expired data, and then those SSTables will be compacted into a SSTable without the
expired data. Is it only during this compaction that the data will be removed? It seems to
me that this would require Cassandra to maintain some metadata on which rows have been deleted
since the newer tombstones would likely not be in the older SSTables that are being compacted.
Also, I'm aware that Cassandra can drop entire SSTables if they contain only expired data
but I'm unsure of what qualifies as expired data (is it just SSTables whose maximum timestamp
is past the default TTL for the table?) and when such SSTables are dropped.

Alternatively, do the SSTables which contain the tombstones have to be compacted with the
SSTables which contain the expired data for the data to be removed? It seems to me that this
could result in Cassandra holding the expired data long after it has expired since it's waiting
for the new tombstones to be compacted with the older expired data.

Finally, I was also unsure when the tombstones themselves are removed. I know Cassandra does
not delete them until after gc_grace_seconds but it can't delete the tombstones until it's
sure the expired data has been deleted right? Otherwise it would see the expired data as being
valid. Consequently, it seems to me that the question of when tombstones are deleted is intimately
tied to the questions above.

Thanks in advance! If it helps I've been experimenting with version 2.0.15 myself.

Mime
View raw message