Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E534967E1 for ; Tue, 31 May 2011 00:58:15 +0000 (UTC) Received: (qmail 34095 invoked by uid 500); 31 May 2011 00:58:13 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 34066 invoked by uid 500); 31 May 2011 00:58:13 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 34058 invoked by uid 99); 31 May 2011 00:58:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 00:58:13 +0000 X-ASF-Spam-Status: No, hits=4.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com designates 209.85.161.172 as permitted sender) Received: from [209.85.161.172] (HELO mail-gx0-f172.google.com) (209.85.161.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 00:58:07 +0000 Received: by gxk19 with SMTP id 19so2221912gxk.31 for ; Mon, 30 May 2011 17:57:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=fjpg41I6yDun+9Cj92jZ+AX6RfrEHLRWW4F+hUL3NDc=; b=hXusdj0D2oB7TySwLJI1fQYH674xpGrWGtN109t3HNe9F5DxBdsAS4qM/HW6X/EM2Z +JpioJhSCkLCAMif5DsTWWYD8TiRrCg6PjGmSlIipQCv6dWdTmB2SzXOPALyBrVNbWsy XgiSD8dZFEcaIC4vcPk0SvlY5eGUguqJfPc5I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=L1TT5CZqYSNlfboAArNZhDJDZXqWLE2kBoKC7Eud9fPS+lNk01Jszja/WcssABcB9/ AAGX8yNtx091xV6Tr5dz9V9KSEks4Rk3rYLEhVRkw2ewTrGHQtlcuceFlc1yfp6EJHQQ 9BGgqm7IJjLrXu+UevqtPznrWWk1YPrIZtU30= MIME-Version: 1.0 Received: by 10.236.79.3 with SMTP id h3mr2471905yhe.380.1306803466388; Mon, 30 May 2011 17:57:46 -0700 (PDT) Received: by 10.236.199.72 with HTTP; Mon, 30 May 2011 17:57:46 -0700 (PDT) Date: Mon, 30 May 2011 17:57:46 -0700 Message-ID: Subject: clarification of the consistency guarantees of Counters From: Yang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf300fb31767c5e104a487e4e9 --20cf300fb31767c5e104a487e4e9 Content-Type: text/plain; charset=ISO-8859-1 I went through https://issues.apache.org/jira/browse/CASSANDRA-1072 and realize that the consistency guarantees of Counters are a bit different from those of regular columns, so could you please confirm that the following are true? 1) comment https://issues.apache.org/jira/browse/CASSANDRA-1072?focusedCommentId=12900659&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12900659 still holds : "there is no way to create a write CL greater than ONE, and thus, no defense against *permanent* failures of single machines" 2) due to the above, the best I can achieve to increase reliability is to enable REPLICATE_ON_WRITE, but this would still expose the recent updates on the leader to being lost during a short interval 3) without REPLICATE_ON_WRITE (or equivalently, read repair ) I would have to do CL=ALL on read. then in this case, if the leader fails, all future reads fail. so for counters I have to enable REPLICATE_ON_WRITE or set read_repair chance to a reasonably high value, and do read CL!= ALL. apart from the questions, some thoughts on Counters: the idea of distributed counters can be seen, in distributed algorithms terms, as a state machine (see Fred Schneider 93'), where ideally we send the messages (delta increments) to each node, and the final state (sum of deltas, or the counter value) is deduced independently at each node. in the current implementation, it's really not a distributed state machine, since state is deduced only at the leader, and what is replicated is just the final state. in fact, the data from different leaders are orthogonal, and within the data flow from one leader,* it's really just a master-slave system. then we realize that this system is prone to single master failure.* if we want to build a truely distributed state machine, I am afraid there are no easier/faster solutions than existing ones (Paxos, etc). But I guess that a possible solution could lay in the fact that our goal allows for a relaxation than traditional state machine: Eventually consistent, and also that our operations are commutative ( re-ordering 2 adds yields the same state , when we apply the state changes ). how we take advantage of these facts could probably enable us to come to a truely distributed counters solution. the route of keeping all individual updates at each node has been mentioned in the JIRA, and later do reconciliation on the history. because messages losses are less common than success, maybe this is not as bad a route as we thought?? Thanks Yang --20cf300fb31767c5e104a487e4e9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I went through=A0https://issues.apache.org/jira/browse/CASSANDRA-10= 72

and realize that the consistency guarantees of Co= unters are a bit different from those of regular columns, so could you plea= se confirm
that the following are true?

still holds : "there is no way to create = a write CL greater than ONE,=A0and thus, no = defense against=A0permanent=A0failures of single machines"=A0
2) due to the above, the best I can achieve to= increase reliability is to enable REPLICATE_ON_WRITE, but this =A0would st= ill expose the recent updates on the leader to being lost during a short in= terval
3) without REPLICATE_ON_WRITE (or equivalently= , read repair ) I would have to do CL=3DALL on read. then in this case, if = the leader fails, all future reads fail. so for counters I have to enable= =A0
REPLICATE_ON_WRITE or set read_repair chance t= o a reasonably high value, and do read CL!=3D ALL.

=




apart from the questions, some th= oughts on Counters:
the idea of di= stributed counters can be seen, in distributed algorithms terms, as a state= machine (see Fred Schneider 93'), =A0where ideally we send the message= s (delta increments) to each node, and the final state (sum of deltas, or t= he counter value) is deduced independently at each node. =A0in the current = implementation, it's really not a distributed state machine, since stat= e is deduced only at the leader, and what is replicated is just the final s= tate. in fact, the data from different leaders are orthogonal, and within t= he data flow from one leader, it's really just a master-slave system= . then we realize that this system is prone to single master failure.

if we want to build a truely distributed state machine, I am a= fraid there are no easier/faster solutions than existing ones (Paxos, etc).= But I guess that a possible solution could lay in the fact that our goal a= llows for a relaxation than traditional state machine: Eventually consisten= t, and also that our operations are commutative ( re-ordering 2 adds yields= the same state , when we apply the state changes ). how we take advantage = of these facts could probably enable us to come to a truely distributed cou= nters solution.

= the route of keeping all individual updates at each node has been mentioned= in the JIRA, and later do reconciliation on the history. because messages = losses are less common than success, maybe this is not as bad a route as we= thought??

=
Thanks
Yang

--20cf300fb31767c5e104a487e4e9--