Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 31104 invoked from network); 29 Mar 2011 19:37:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Mar 2011 19:37:46 -0000 Received: (qmail 6481 invoked by uid 500); 29 Mar 2011 19:37:41 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 6414 invoked by uid 500); 29 Mar 2011 19:37:41 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 6224 invoked by uid 99); 29 Mar 2011 19:37:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 19:37:41 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 19:37:32 +0000 Received: by yxk30 with SMTP id 30so249108yxk.31 for ; Tue, 29 Mar 2011 12:37:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.95.1 with SMTP id x1mr544029ybl.182.1301427431101; Tue, 29 Mar 2011 12:37:11 -0700 (PDT) Sender: scode@scode.org Received: by 10.150.147.9 with HTTP; Tue, 29 Mar 2011 12:37:11 -0700 (PDT) X-Originating-IP: [213.114.156.79] In-Reply-To: <1301424132945-6220423.post@n2.nabble.com> References: <1301417775944-6220005.post@n2.nabble.com> <1301420212192-6220171.post@n2.nabble.com> <1301421070734-6220228.post@n2.nabble.com> <1301424132945-6220423.post@n2.nabble.com> Date: Tue, 29 Mar 2011 21:37:11 +0200 X-Google-Sender-Auth: zIPkrFFTa-0sP7HLPCxiF2coRb0 Message-ID: Subject: Re: How to determine if repair need to be run From: Peter Schuller To: user@cassandra.apache.org Cc: mcasandra , cassandra-user@incubator.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org First some specifics: > I think my problem is that I don't want to remember to run read repair. I You are not expected to remember to do so manually. Typically periodic repairs would be automated in some fashion, such as by having a cron job on each node that starts the repair. Typically some kind of logic may be applied to avoid running repair on all nodes at the same time. > want to know from cassandra that I "need" to run repair "now". This seems > like a important functionality that need to be there. I don't really want to > find out hard way that I forgot to run "repair" :) See further below. > Say Node A, B, C. Now A is inconsistent and needs repair. Now Node B goes > down. Even with Quorum this will fail read and write. There could be other WIth writes and reads at QUORUM, a read following a write is guaranteed to see the write. If enough nodes are down such that QUORUM is not satisfied, the read operation will fail. Node B going down above is not a problem. If your RF is 3, a write would have been required to succeed on A and B, or B and C, or A and B. Since reads have the same requirement, there is by definition always overlap between the read set and write set. This is the fundamental point of using QUORUM. > scenarios. Looks like repair is critical commands that is expected to be > run, but "when"? Saying once within GCGraceSeconds might be ok for some but > not for everyone where we want bring all nodes in sync ASAP. Let me try to put it in a different light. The reasons to use 'nodetool repair' seems to fall roughly into two categories: (a) Ensuring that 'nodetool repair' has been run within GCGraceSeconds. (b) Helping to increase the 'average' level of consistency as observed by the application. These two cases are very very different. Cassandra makes certain promises on data consistency, that clients can control in part by consistency levels. If (a) fails, such that a 'nodetool repair' was not run in time, the cluster will behave *incorrectly*. It will fail to satisfy the guarantees that it supposedly promises. This is essentially a binary condition; either you run nodetool repair as often as is required for correct functioning, or you don't. This is a *hard* requirement, but is entirely irrelevant until you actually reach the limit imposed by GCGraceSeconds. There is no need to run 'repair' as soon as possible (for some definition of 'soon as possible') in order to satisfy (a). You're 100% fine until you're not, at which time you've caused Cassandra to violate its guarantees. So - it's *important* to run repair due to (a), but it is not *urgent* to do so. (b) on the other hand is very different. Assuming your application and cluster is one that wants to run repair more often than GCGraceSeconds for whatever reason (for example, for performance you want to use CL.ONE and turn off read-repair, but your data set is such that it's practical to use pretty frequent repairs to keep inconsistencies down), it may be beneficial to do so. But this is essentially a soft 'preference' for how often repairs should be run; there is no magic limit at which something breaks where it did not break before. This becomes a matter of setting a reasonable repair frequency for your use case, and and individual node perhaps failing a repair once for some obscure reason is not an issue. For (b), you should be fine just triggering repair sufficiently often as appropriate with no need to even have strict monitoring or demands. Almost by definition the requirements are not strict; if they were stricter, you should be using QUORUM or maybe ONE + read repair. So in this case, "remembering" is not a problem - you just install your cronjob that does it often enough, approximately, and don't worry about it. For (a), there is the hard requirement. So this is where you *really* want it completing, and preferably have some kind of alarm/notification if a repair doesn't run in time. Note that for (b), it doesn't help to know the moment a write didn't get replicated fully. That's bound to happen often (every time a node is restarted, there is some short hiccup, etc). A single write failing to replicate is an almost irrelevant event. For (a) on the other hand, it *is* helpful and required to keep track of the time of the last successful repair. Cassandra could be better at making this easier I think, but it is an entirely different problem than detecting that "somewhere in the cluster, a non-zero amount of writes may possibly have failed to replicate". The former is directly relevant and important; the latter is almost always completely irrelevant to the problem at hand. Sorry to be harping on the same issue, but I really think it's worth trying to be clear about this from the start :) If you do have a use-case that somehow truly is not consistent with the above, it would however be interesting to hear what it is. Is the above clear? I'm thinking maybe it's worth adding to the FAQ unless it's more confusing than helpful. -- / Peter Schuller