incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <>
Subject Re: (unofficial) Community Poll for Production Operators : Repair
Date Thu, 16 May 2013 11:49:08 GMT
@Rob: Thanks about the feedback.

Yet I have a weird behavior still unexplained about repairing. Are counters
supposed to be "repaired" too ? I mean, while reading at CL.ONE I can have
different values depending on what node is answering. Even after a read
repair or a full repair. Shouldn't a repair fix these discrepancies ?

The only way I found to get always the same count is to read data at
CL.QUORUM, but this is a workaround since the data itself remains wrong on
some nodes.

Any clue on it ?


2013/5/15 Edward Capriolo <>

> Introduced Active Anti-Entropy. Riak now has active anti-entropy. In
> distributed systems, inconsistencies can arise between replicas due to
> failure modes, concurrent updates, and physical data loss or corruption.
> Pre-1.3 Riak already had several features for repairing this “entropy”, but
> they all required some form of user intervention. Riak 1.3 introduces
> automatic, self-healing properties that repair entropy on an ongoing basis.
> On Wed, May 15, 2013 at 5:32 PM, Robert Coli <> wrote:
>> On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ <>
>> wrote:
>> > Rob, I was wondering something. Are you a commiter working on improving
>> the
>> > repair or something similar ?
>> I am not a committer [1], but I have an active interest in potential
>> improvements to the best practices for repair. The specific change
>> that I am considering is a modification to the default
>> gc_grace_seconds value, which seems picked out of a hat at 10 days. My
>> view is that the current implementation of repair has such negative
>> performance consequences that I do not believe that holding onto
>> tombstones for longer than 10 days could possibly be as bad as the
>> fixed cost of running repair once every 10 days. I believe that this
>> value is too low for a default (it also does not map cleanly to the
>> work week!) and likely should be increased to 14, 21 or 28 days.
>> > Anyway, if a commiter (or any other expert) could give us some feedback
>> on
>> > our comments (Are we doing well or not, whether things we observe are
>> normal
>> > or unexplained, what is going to be improved in the future about
>> repair...)
>> 1) you are doing things according to best practice
>> 2) unfortunately your experience with significantly degraded
>> performance, including a blocked go-live due to repair bloat is pretty
>> typical
>> 3) the things you are experiencing are part of the current
>> implementation of repair and are also typical, however I do not
>> believe they are fully "explained" [2]
>> 4) as has been mentioned further down thread, there are discussions
>> regarding (and some already committed) improvements to both the
>> current repair paradigm and an evolution to a new paradigm
>> Thanks to all for the responses so far, please keep them coming! :D
>> =Rob
>> [1] hence the (unofficial) tag for this thread. I do have minor
>> patches accepted to the codebase, but always merged by an actual
>> committer. :)
>> [2] driftx@#cassandra feels that these things are explained/understood
>> by core team, and points to
>> as a useful
>> approach to minimize same.

View raw message