Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 20 Dec 2011 12:19:32 +0000 (UTC)
From: "Sylvain Lebresne (Commented) (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: 
 <1341796140.30501.1324383572450.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <189759327.16831.1323979890602.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (CASSANDRA-3641) inconsistent/corrupt counters
 w/ broken shards never converge
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173138#comment-13173138 ] 

Sylvain Lebresne commented on CASSANDRA-3641:
---------------------------------------------

Let's open a separate ticket to discuss that. So far we've use the log only for recording errors so let's keep it at that for this ticket.
                
> inconsistent/corrupt counters w/ broken shards never converge
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-3641
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3641
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: 3641-0.8-internal-not-for-inclusion.txt, 3641-trunk.txt
>
>
> We ran into a case (which MIGHT be related to CASSANDRA-3070) whereby we had counters that were corrupt (hopefully due to CASSANDRA-3178). The corruption was that there would exist shards with the *same* node_id, *same* clock id, but *different* counts.
> The counter column diffing and reconciliation code assumes that this never happens, and ignores the count. The problem with this is that if there is an inconsistency, the result of a reconciliation will depend on the order of the shards.
> In our case for example, we would see the value of the counter randomly fluctuating on a CL.ALL read, but we would get consistent (whatever the node had) on CL.ONE (submitted to one of the nodes in the replica set for the key).
> In addition, read repair would not work despite digest mismatches because the diffing algorithm also did not care about the counts when determining the differences to send.
> I'm attaching patches that fixes this. The first patch is against our 0.8 branch, which is not terribly useful to people, but I include it because it is the well-tested version that we have used on the production cluster which was subject to this corruption.
> The other patch is against trunk, and contains the same change.
> What the patch does is:
> * On diffing, treat as DISJOINT if there is a count discrepancy.
> * On reconciliation, look at the count and *deterministically* pick the higher one, and:
> ** log the fact that we detected a corrupt counter
> ** increment a JMX observable counter for monitoring purposes
> A cluster which is subject to such corruption and has this patch, will fix itself with and AES + compact (or just repeated compactions assuming the replicate-on-compact is able to deliver correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira