cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Dejanovski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
Date Wed, 26 Apr 2017 10:04:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984514#comment-15984514
] 

Alexander Dejanovski commented on CASSANDRA-13418:
--------------------------------------------------

[~iksaif], I have a similar patch waiting to be tested on my laptop actually :)

The naming of the option in my version was deliberately scary so that people would (hopefully)
give it a good thinking before using it : unsafe_expired_sstable_deletion

I fully agree that this option (whatever the name) should be available for TWCS (and why not
DTCS) because the typical use case is to use TTL as a deletion mechanism and not explicit
DELETE statements, which should be done in some rare cases only. If a cluster is a bit unhealthy
for whatever reason, it is painful to see read repair forcing tens of GB of data to stay on
disk because of timestamp overlaps.

The only possible zombie data in a correct TWCS use case (all data is written with TTLs) would
be if the tombstone and the data it shadows are written in the same time window (and of course
the data is missing on one node).

If the data and the tombstone live in different buckets, we'll be in the following scenario
:  
- data is written in bucket 1 with a TTL but the write fails on one node
- the tombstone is written in bucket 2 on all nodes : data and tombstone will then never be
compacted together since they live in different buckets. 
- In bucket 3 there is a read repair that replicates the data on the node that missed it,
which should have been in bucket 1. It's written with the same timestamp/TTL and will expire
at the same time than all other nodes, even if the tombstone is collected before (which won't
happen until TTL expires). 

If the tombstone and the data it shadows live in the same bucket, and the TTL is longer than
gc_grace_seconds, then it's indeed possible to have reappearing data, but even then I'm not
sure it could happen : During the bucket's major compaction, the data and tombstone would
most likely be merged and only the tombstone would survive, preventing the possibility of
having a subsequent read repair to replicate the data in the next time windows. 
[~jjirsa] [~krummas] : I may be wrong here in the way compaction actually merges tombstones
and data before gc_grace_seconds, so please correct me if necessary.

IMHO it is worth enduring a slight chance of reappearing data in a TTL workload, by choice,
in order to allow optimal space savings.

After looking at your patch, it could be interesting performance wise to fully skip calling
getOverlappingSSTables() in order to avoid searching and storing overlaps only to void them
afterwards, by modifying this line instead : https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/TimeWindowCompactionStrategy.java#L107

To sum up : big +1, it'll help ops that try to fight with low disk space and don't understand
why expired SSTables don't get deleted.

> Allow TWCS to ignore overlaps when dropping fully expired sstables
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-13418
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13418
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: Corentin Chary
>              Labels: twcs
>
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If you really
want read-repairs you're going to have sstables blocking the expiration of other fully expired
SSTables because they overlap.
> You can set unchecked_tombstone_compaction = true or tombstone_threshold to a very low
value and that will purge the blockers of old data that should already have expired, thus
removing the overlaps and allowing the other SSTables to expire.
> The thing is that this is rather CPU intensive and not optimal. If you have time series,
you might not care if all your data doesn't exactly expire at the right time, or if data re-appears
for some time, as long as it gets deleted as soon as it can. And in this situation I believe
it would be really beneficial to allow users to simply ignore overlapping SSTables when looking
for fully expired ones.
> To the question: why would you need read-repairs ?
> - Full repairs basically take longer than the TTL of the data on my dataset, so this
isn't really effective.
> - Even with a 10% chances of doing a repair, we found out that this would be enough to
greatly reduce entropy of the most used data (and if you have timeseries, you're likely to
have a dashboard doing the same important queries over and over again).
> - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow.
> I'll try to come up with a patch demonstrating how this would work, try it on our system
and report the effects.
> cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message