Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97ED44368 for ; Tue, 31 May 2011 08:22:18 +0000 (UTC) Received: (qmail 31521 invoked by uid 500); 31 May 2011 08:22:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 30914 invoked by uid 500); 31 May 2011 08:22:15 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 30901 invoked by uid 99); 31 May 2011 08:22:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 08:22:13 +0000 X-ASF-Spam-Status: No, hits=4.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com designates 209.85.213.172 as permitted sender) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 08:22:08 +0000 Received: by yxk30 with SMTP id 30so2326519yxk.31 for ; Tue, 31 May 2011 01:21:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=ErFw2qSLsjluJEba6katf2SwH0NqaJLOIA9JxCV6cWU=; b=KoImwSZBMuisd8018aorloiqXm9lPEtseUymyXlr1xQSDyE65H0ebrWzCMMWyDNORW PIPXsqjFHNwgGejjibUHYwZqOHYR5b9n38uBcaAidHGYvKiTFhXsp940939ohEjDqXLO 8OL5awsZ4ONlkTy9N5fUoYL6P5UVvf8aFiPVU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=lY1M3xGqWkG86vmJmH4xvmx7vE+WgiL5OmXxhaiAkQL/nSC9LotnbJO8CAtUvrourZ QzqNydtyzpelP15zD13mAS0H20JcayndnQn37BP9PvD7qBfPQ/6YytYuUytQhQiLXXVN 8e65NGomk+ay3licclwLSSunvOa0BRzR38lBE= MIME-Version: 1.0 Received: by 10.236.187.70 with SMTP id x46mr7053993yhm.179.1306830107271; Tue, 31 May 2011 01:21:47 -0700 (PDT) Received: by 10.236.199.72 with HTTP; Tue, 31 May 2011 01:21:47 -0700 (PDT) In-Reply-To: References: Date: Tue, 31 May 2011 01:21:47 -0700 Message-ID: Subject: Re: clarification of the consistency guarantees of Counters From: Yang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf305b12c253625604a48e18ed --20cf305b12c253625604a48e18ed Content-Type: text/plain; charset=ISO-8859-1 thanks Sylvain, I agree with what you said for the first few paragraphs ---- Jeremy corrected me just now. regarding the last point, you are right in using the term "by operation", but you should also note that it's a leader "data ownership", in the meaning that the leader has the authoritative power when it comes to reconciliation on that bucket of count owned by the leader ----- yes you've convinced me that we DO need to use CL > ONE, but for the sake of argument, if CL = ONE is used, the leader's data loss causes the other replicas to not being able to reconcile, that's what I mean. but anyway it's not relevant now since CL can be > ONE but I'd really appreciate if you could give some review to my newer post on FIFO, I think that could be an interesting approach yang On Tue, May 31, 2011 at 12:59 AM, Sylvain Lebresne wrote: > > >apart from the questions, some thoughts on Counters: > >the idea of distributed counters can be seen, in distributed algorithms > terms, as a state machine (see Fred Schneider 93'), where ideally we send > the messages (delta increments) to each node, and the final state (sum of > deltas, or the counter value) is deduced independently at each node. in the > current implementation, it's really not a distributed state machine, since > state is deduced only at the leader, and what is replicated is just the > final state. in fact, the data from different leaders are orthogonal, and > within the data flow from one leader, it's really just a master-slave > system. then we realize that this system is prone to single master failure. > > Don't get fooled by the term 'leader': there is one leader *by > operation*, not one global leader. Again, the leader of an operation > is really just 'the first of the replica we're replicating to'. > > It's not more a master-slave design than regular writes are because > they use a distinguished coordinator node for each operation. And it's > not prone to single node failure because if you do counter increments > at CL.QUORUM against say a cluster with RF=3, then you will still be > able to write and read even if one node is down and which node exactly > doesn't matter at all. > > -- > Sylvain > --20cf305b12c253625604a48e18ed Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable thanks Sylvain, I agree with what you said for the first few paragraphs ---= - Jeremy corrected me just now.

regarding the last point= , you are right in using the term "by operation", but you should = also note that it's a leader
"data ownership", in the meaning that the leader has the aut= horitative power when it comes to reconciliation on that
bucket o= f count owned by the leader ----- =A0yes you've convinced me that we DO= need to use CL > ONE, but for the sake of
argument, if CL =3D ONE is used, the leader's data loss causes the= other replicas to not being able to reconcile, that's what I mean.
but anyway it's not relevant now since CL can be > ONE


but I'd really appreciate if you cou= ld give some review to my newer post on FIFO, I think that could be an inte= resting approach


yang


On Tue, May 31, 2011 at 12:59 AM, S= ylvain Lebresne <sylvain@datastax.com> wrote:
>apart from the questions, some thoughts on Counters:
>the idea of distributed counters can be seen, in distributed algorithms= terms, as a state machine (see Fred Schneider 93'), =A0where ideally w= e send the messages (delta increments) to each node, and the final state (s= um of deltas, or the counter value) is deduced independently at each node. = =A0in the current implementation, it's really not a distributed state m= achine, since state is deduced only at the leader, and what is replicated i= s just the final state. in fact, the data from different leaders are orthog= onal, and within the data flow from one leader, it's really just a mast= er-slave system. then we realize that this system is prone to single master= failure.

Don't get fooled by the term 'leader': there is one leade= r *by
operation*, not one global leader. Again, the leader of an operation
is really just 'the first of the replica we're replicating to'.=

It's not more a master-slave design than regular writes are because
they use a distinguished coordinator node for each operation. And it's<= br> not prone to single node failure because if you do counter increments
at CL.QUORUM against say a cluster with RF=3D3, then you will still be
able to write and read even if one node is down and which node exactly
doesn't matter at all.

--
Sylvain

--20cf305b12c253625604a48e18ed--