In a distributed system, such as Cassandra, things can happen (node down, stop the world GC, hardware issue, ...) and desynchronize replicas, isn't repairing also a needed operation to keep replicas up to date at least once a week or once a month ? It is a strong and reliable process to keep things synced, isn't it ?

I know that read repairs and hinted handoff are also there to handle this kind of issues, but they might fail (I saw a lot of error in the logs around hints not being delivered - some people even disable them - and read repairs are often configured to trigger on 10% of the reads).

2014-01-28 14:53 GMT+01:00 Sylvain Lebresne <sylvain@datastax.com>:

I have actually set up one of our application streams such that the same key is only overwritten with a monotonically increasing ttl.

For example, a breaking news item might have an initial ttl of 60 seconds, followed in 45 seconds by an update with a ttl of 3000 seconds, followed by an 'ignore me' update in 600 seconds with a ttl of 30 days (our maximum ttl) when the article is published.

My understanding is that this case fits the criteria and no 'periodic repair' is needed.

That's correct. The real criteria for not needing repair if you do no deletes but only TTL is "update only with monotonically increasing (non necessarily strictly) ttl". Always setting the same TTL is just a special case of that, but it's the most commonly used one I think, so I tend to simplify it to that case.

I guess another thing I would point out that is easy to miss or forget (if you are a newish user like me), is that ttl's are fine-grained, by column. So we are talking 'fixed' or 'variable' by individual column, not by table. Which means, in my case, that ttl's can vary widely across a table, but as long as I constrain them by key value to be fixed or monotonically increasing, it fits the criteria.

We're talking monotonically increasing ttl "for a given primary key' if we're talking the CQL language and "for a given column" if we're talking the thrift one. Not "by table".





On Tue, Jan 28, 2014 at 4:18 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
On Tue, Jan 28, 2014 at 1:05 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
If you have only ttl columns, and you never update the column I would not think you need a repair.

Right, no deletes and no updates is the case 1. of Michael on which I think we all agree 'periodic repair to avoid resurrected columns' is not required.

Repair cures lost deletes. If all your writes have a ttl a lost write should not matter since the column was never written to the node and thus could never be resurected on said node.

I'm sure we're all in agreement here, but for the record, this is only true if you have no updates (overwrites) and/or if all writes have the *same* ttl. Because in the general case, a column with a relatively short TTL is basically very close to a delete, while a column with a long TTL is very close from one that has no TTL. If the former column (with short TTL) overwrites the latter one (with long TTL), and if one nodes misses the overwrite, that node could resurrect the column with the longer TTL (until that column expires that is). Hence the separation of the case 2. (fixed ttl, no repair needed) and 2.a. (variable ttl, repair may be needed).


Unless i am missing something.

On Monday, January 27, 2014, Laing, Michael <michael.laing@nytimes.com> wrote:
> Thanks Sylvain,
> Your assumption is correct!
> So I think I actually have 4 classes:
> 1.    Regular values, no deletes, no overwrites, write heavy, variable ttl's to manage size
> 2.    Regular values, no deletes, some overwrites, read heavy (10 to 1), fixed ttl's to manage size
> 2.a. Regular values, no deletes, some overwrites, read heavy (10 to 1), variable ttl's to manage size
> 3.    Counter values, no deletes, update heavy, rotation/truncation to manage size
> Only 2.a. above requires me to do 'periodic repair'.
> What I will actually do is change my schema and applications slightly to eliminate the need for overwrites on the only table I have in that category.
> And I will set gc_grace_period to 0 for the tables in the updated schema and drop 'periodic repair' from the schedule.
> Cheers,
> Michael
> On Mon, Jan 27, 2014 at 4:22 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
>> By periodic repair, I'll assume you mean "having to run repair every gc_grace period to make sure no deleted entries resurrect". With that assumption:
>>> 1. Regular values, no deletes, no overwrites, write heavy, ttl's to manage size
>> Since 'repair within gc_grace' is about avoiding value that have been deleted to resurrect, if you do no delete nor overwrites, you're in no risk of that (and don't need to 'repair withing gc_grace'). 
>>> 2. Regular values, no deletes, some overwrites, read heavy (10 to 1), ttl's to manage size
>> It depends a bit. In general, if you always set the exact same TTL on every insert (implying you always set a TTL), then you have nothing to worry about. If the TTL varies (of if you only set TTL some of the times), then you might still need to have some periodic repairs. That being said, if there is no deletes but only TTLs, then the TTL kind of lengthen the period at which you need to do repair: instead of needing to repair withing gc_grace, you only need to repair every gc_grace + min(TTL) (where min(TTL) is the smallest TTL you set on columns).
>>> 3. Counter values, no deletes, update heavy, rotation/truncation to manage size
>> No deletes and no TTL implies that your fine (as in, there is no need for 'repair withing gc_grace'). 
>> --
>> Sylvain

Sorry this was sent from mobile. Will do less grammar and spell check than usual.