cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Constance Eustace (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-14279) Row Tombstones in separate sstables / separate compaction path
Date Tue, 27 Feb 2018 19:45:01 GMT


Constance Eustace updated CASSANDRA-14279:
    Component/s:     (was: Lifecycle)

> Row Tombstones in separate sstables / separate compaction path
> --------------------------------------------------------------
>                 Key: CASSANDRA-14279
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Local Write-Read Paths, Repair
>            Reporter: Constance Eustace
>            Priority: Major
> In my experience if data is not well organized into time windowed sstables, cassandra
has enormous difficulty in actually deleting data if the data has a "medium term" lifetime.
Or for example, you might have an active working set and be archiving "unused" data to other
tables or clusters. Or you may be purging data. Or you may be migrating/sharding data. Whatever
the case, you want that disk space back. 
> In STCS and LCS, row tombstones are intermingled with column data and column tombstones.
But a row tombstone represents a big event: large amounts of "droppable" data from an sstable,
or even a shortcut from reading data from other sstables.
> I am wondering that if row tombstones were isolated in their own sstables, separately
compacted and merged, that it might enable compaction to work more efficiently: 
> reads can prioritize bloom filter lookups that indicate a row tombstone, getting the
timestamp of the deletion first, then can use that in the data sstables to filter data or
shortcircuit the data if the row data had an overall "most recent data timestamp". 
> compaction could be forced to reference all the row tombstone sstables, such that every
time two or more "data" sstables are compacted, they must reference the row tombstones to
purge data. 
> In LCS, this would be particularly useful in getting data out of the upper levels without
having to wait for data to trickle up the tree. The row tombstones, being read-only inputs
into the data sstable compactions, can be referenced in each of the LCS levels' parallel compactors. 
> Based on discussions in the dev list, this would appear to require some sort of customization
to the memtable->sstable flushing process, and perhaps a different set of bloom filters. 
> Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>,
they should be comparitively smaller and take less time to compact. They could be aggressively
compacted on a different schedule than "data" sstables. 
> In addition, it may be easier to repair/synchronize row tombstones across the cluster
if they have already been separated into their own sstables.
> Column/range tombstones may also benefit from a similar separation, but my guess is those
are much more numerous and large and fine-grained that they might as well coexist with the

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message