Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA2E56275 for ; Mon, 13 Jun 2011 18:26:40 +0000 (UTC) Received: (qmail 32693 invoked by uid 500); 13 Jun 2011 18:26:38 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 32668 invoked by uid 500); 13 Jun 2011 18:26:38 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 32660 invoked by uid 99); 13 Jun 2011 18:26:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jun 2011 18:26:38 +0000 X-ASF-Spam-Status: No, hits=4.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com designates 209.85.218.44 as permitted sender) Received: from [209.85.218.44] (HELO mail-yi0-f44.google.com) (209.85.218.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jun 2011 18:26:33 +0000 Received: by yie30 with SMTP id 30so1403711yie.31 for ; Mon, 13 Jun 2011 11:26:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=PaiQjudT9hksyYYjXaGqCPTevyLZKuKwIr7A88boYS8=; b=IIZqSme5MVsGajF294cNPnG1IRMrjBpwpZ5MU/Up0Lcszwsgy3c2aFINVPt2w4nUcm 4JxNo+15D6maUqhlaywEPwWephRsd7Lk5uiR6zG+tJS6T9OtS3X7wsNtnvlUSdtHXn+J VNVSoAk3VWhpMNZKMv7KIPSgeo3WldoenUrU4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=cZdSBfldbpApzkYata0eguWsi8XnAkmG9QPP/01UC0ZGyhSlTWAkv7OtjHiabXz0mI OgQ6eCnU1+ExPySAZFEawu2FJ/RpU1C1hC1Ft1O6f5L8APfWYBBjvkfkjfQESlU8FWQk C1wnoQtJB9IVHQRGAl6zYqOVjg4H62oHpAYnw= MIME-Version: 1.0 Received: by 10.236.77.9 with SMTP id c9mr7931703yhe.36.1307989572742; Mon, 13 Jun 2011 11:26:12 -0700 (PDT) Received: by 10.236.61.3 with HTTP; Mon, 13 Jun 2011 11:26:12 -0700 (PDT) In-Reply-To: References: Date: Mon, 13 Jun 2011 11:26:12 -0700 Message-ID: Subject: Re: one way to make counter delete work better From: Yang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf30050bfeda6f0004a59c0d7d --20cf30050bfeda6f0004a59c0d7d Content-Type: text/plain; charset=ISO-8859-1 ok, I think it's better to understand it this way, then it is really simple and intuitive: my proposed way of counter update can be simply seen as a combination of regular columns + current counter columns: regular column : [ value: "wipes out every bucket to nil" , clock: epoch number] then within each epoch, counter updates work as currently implemented On Mon, Jun 13, 2011 at 10:12 AM, Yang wrote: > I think this approach also works for your scenario: > > I thought that the issue is only concerned with merging within the same > leader; but you pointed out > that a similar merging happens between leaders too, now I see that the same > rules on epoch number > also applies to inter-leader data merging, specifically in your case: > > > everyone starts with epoch of 0, ( they should be same, if not, it also > works, we just consider them to be representing diffferent time snapshots of > the same counter state) > > node A add 1 clock: 0.100 (epoch = 0, clock number = 100) > > node A delete clock: 0.200 > > node B add 2 clock: 0.300 > > node A gets B's state: add 2 clock 0.300, but rejects it because A has > already produced a delete, with epoch of 0, so A considers epoch 0 already > ended, it won't accept any replicated state with epoch < 1. > > node B gets A's delete 0.200, it zeros its own count of "2", and > updates its future expected epoch to 1. > > at this time, the state of system is: > node A expected epoch =1 [A:nil] [B:nil] > same for node B > > > > let's say we have following further writes: > > node B add 3 clock 1.400 > > node A adds 4 clock 1.500 > > node B receives A's add 4, node B updates its copy of A > node A receives B's add 3, updates its copy of B > > > then state is: > node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] > node B same > > > > generally I think it should be complete if we add the following rule for > inter-leader replication: > > each leader keeps a var in memory (and also persist to sstable when > flushing) expected_epoch , initially set to 0 > > node P does: > on receiving updates from node Q > if Q.expected_epoch > P.expected_epoch > /** an epoch bump inherently means a previous delete, which > we probably missed , so we need to apply the delete > a delete is global to all leaders, so apply it on all my > replicas **/ > for all leaders in my vector > count = nil > > P.expected_epoch = Q.expected_epoch > if Q.expected_epoch == P.expected_epoch > update P's copy of Q according to standard rules > /** if Q.expected_epoch < P.expected_epoch , that means Q is less > up to date than us, just ignore > > > replicate_on_write(to Q): > if P.operation == delete > P.expected_epoch ++ > set all my copies of all leaders to nil > send to Q ( P.total , P.expected_epoch) > > > > > overall I don't think delete being not commutative is a fundamental blocker > : regular columns are also not commutative, yet we achieve stable result no > matter what order they are applied, because of the ordering rule used in > reconciliation; here we just need to find a similar ordering rule. the epoch > thing could be a step on this direction. > > > Thanks > Yang > > > > > On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis wrote: > >> I don't think that's bulletproof either. For instance, what if the >> two adds go to replica 1 but the delete to replica 2? >> >> Bottom line (and this was discussed on the original >> delete-for-counters ticket, >> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes >> are not fully commutative which makes them fragile. >> >> On Mon, Jun 13, 2011 at 10:54 AM, Yang wrote: >> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 >> > indicates, the problem with counter delete is in scenarios like the >> > following: >> > add 1, clock 100 >> > delete , clock 200 >> > add 2 , clock 300 >> > if the 1st and 3rd operations are merged in SStable compaction, then we >> > have >> > delete clock 200 >> > add 3, clock 300 >> > which shows wrong result. >> > >> > I think a relatively simple extension can be used to complete fix this >> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the >> clock, >> > so that >> > 1) a delete operation increases future epoch number by 1 >> > 2) merging of delta adds can be between only deltas of the same >> epoch, >> > deltas of older epoch are simply ignored during merging. merged result >> keeps >> > the epoch number of the newest seen. >> > other operations remain the same as current. note that the above 2 rules >> are >> > only concerned with merging within the deltas on the leader, and not >> related >> > to the replicated count, which is a simple final state, and observes the >> > rule of "larger clock trumps". naturally the ordering rule is: >> epoch1.clock1 >> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > >> clock2 >> > intuitively "epoch" can be seen as the serial number on a new >> "incarnation" >> > of a counter. >> > >> > code change should be mostly localized to CounterColumn.reconcile(), >> > although, if an update does not find existing entry in memtable, we >> need to >> > go to sstable to fetch any possible epoch number, so >> > compared to current write path, in the "no replicate-on-write" case, we >> need >> > to add a read to sstable. but in the "replicate-on-write" case, we >> already >> > read that, so it's no extra time cost. "no replicate-on-write" is not a >> > very useful setup in reality anyway. >> > >> > does this sound a feasible way? if this works, expiring counter should >> > also naturally work. >> > >> > Thanks >> > Yang >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> > > --20cf30050bfeda6f0004a59c0d7d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable ok, I think it's better to understand it this way, then it is really si= mple and intuitive:

my proposed way of counter update ca= n be simply seen as a combination of regular columns + current counter colu= mns:

regular column : =A0[ value: "wipes out every buck= et to nil" =A0 , clock: epoch number]
then within each epoch= , counter updates work as currently implemented


On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy123@gmail= .com> wrote:
I think this approach also works for your scenario:

I thought that the issue is only concerned with merging within the s= ame leader; but you pointed out=A0
that a similar merging happens= between leaders too, now I see that the same rules on epoch number
also applies to inter-leader data merging, specifically in your case:<= /div>


everyone starts with epoch of 0, ( = they should be same, if not, it also works, we just consider them to be rep= resenting diffferent time snapshots of the same counter state)

node A =A0 =A0 =A0add 1 =A0 =A0clock: =A00.100 =A0(epoc= h =3D 0, clock number =3D 100)

node A =A0 =A0 =A0d= elete =A0 =A0clock: =A00.200

node B =A0 =A0 add 2 = =A0 =A0 clock: =A00.300=A0

node A =A0 =A0gets B's state: =A0add 2 clock 0.300, but reje= cts it because A has already produced a delete, with epoch of 0, so A consi= ders epoch 0 already ended, it won't accept any replicated state with e= poch < 1.

node B =A0 =A0gets A's delete =A00.200, =A0it zeros= its own count of "2", and updates its future expected epoch to 1= .

at this time, the state of system is:
= node A =A0 =A0 expected epoch =3D1 =A0[A:nil] [B:nil]
same for node B



let's say we have following further writes:

n= ode B =A0add 3 =A0clock =A01.400

node A adds 4 =A0= clock 1.500

node B receives A's add 4, =A0 node B updates its c= opy of A
node A receives B's add 3, =A0 =A0updates its copy o= f B


then state is:
node A= =A0, expected epoch =3D=3D 1 =A0 =A0[A:4 =A0clock=3D400] [B:3 =A0 clock=3D= 500]
node B same



gen= erally I think it should be complete if we add the following rule for inter= -leader replication:

each leader keeps a var in me= mory (and also persist to sstable when flushing) =A0expected_epoch , initia= lly set to 0

node P does:
on receiving updates from =A0nod= e Q
=A0 =A0 =A0 =A0 if Q.expected_epoch > P.expected_epoch=A0<= /div>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 /** an epoch bump inherently means a = previous delete, which we probably missed , so we need to apply the delete<= /div>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 a delete is global to all leaders,= so apply it on all my replicas **/
=A0 =A0 =A0 =A0 =A0 =A0 =A0fo= r all leaders in my vector
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 co= unt =3D nil
=A0 =A0 =A0 =A0 =A0 =A0=A0
=A0 =A0 =A0 =A0 = =A0 =A0 =A0P.expected_epoch =3D =A0Q.expected_epoch
=A0 =A0 =A0 =A0 if Q.expected_epoch =3D=3D P.expected_epoch
= =A0 =A0 =A0 =A0 =A0 =A0 =A0update P's copy of Q according to standard r= ules
=A0 =A0 =A0 =A0 /** if Q.expected_epoch < P.expected_epoc= h =A0, that means Q is less up to date than us, just ignore=A0


replicate_on_write(to Q):
=A0 = =A0 =A0 if =A0P.operation =3D=3D delete
=A0 =A0 =A0 =A0 =A0 =A0 P= .expected_epoch ++
=A0 =A0 =A0 =A0 =A0 =A0 set all my copies of a= ll leaders to nil
=A0 =A0 =A0 send to Q ( P.total , P.expected_ep= och)




overall I = don't think delete being not commutative is a fundamental blocker : reg= ular columns are also not commutative, yet we achieve stable result no matt= er what order they are applied, because of the ordering rule used in reconc= iliation; here we just need to find a similar ordering rule. the epoch thin= g could be a step on this direction.


Thanks
Yang




=
On Mon, Jun 13, 2011 at 9:04 AM, Jona= than Ellis <jbellis@gmail.com> wrote:
I don't think that's bulletproof eit= her. =A0For instance, what if the
two adds go to replica 1 but the delete to replica 2?

Bottom line (and this was discussed on the original
delete-for-counters ticket,
https://issues.apache.org/jira/browse/CASSANDRA-2101), counter = deletes
are not fully commutative which makes them fragile.

On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy123@gmail.com> wrote:
> as=A0https://issues.apache.org/jira/browse/CASSANDRA-2101<= br> > indicates, the problem with counter delete is =A0in scenarios like the=
> following:
> add 1, clock 100
> delete , clock 200
> add =A02 , clock 300
> if the 1st and 3rd operations are merged in SStable compaction, then w= e
> have
> delete =A0clock 200
> add 3, =A0clock 300
> which shows wrong result.
>
> I think a relatively simple extension can be used to complete fix this=
> issue: similar to ZooKeeper, we can prefix an "Epoch" number= to the clock,
> so that
> =A0 =A01) a delete operation increases future epoch number by 1
> =A0 =A02) merging of delta adds can be between only deltas of the same= epoch,
> deltas of older epoch are simply ignored during merging. merged result= keeps
> the epoch number of the newest seen.
> other operations remain the same as current. note that the above 2 rul= es are
> only concerned with merging within the deltas on the leader, and not r= elated
> to the replicated count, which is a simple final state, and observes t= he
> rule of "larger clock trumps". naturally the ordering rule i= s: epoch1.clock1
>> epoch2.clock2 =A0iff epoch1 > epoch2 || epoch1 =3D=3D epoch2 &a= mp;& clock1 > clock2
> intuitively "epoch" can be seen as the serial number on a ne= w "incarnation"
> of a counter.
>
> code change should be mostly localized to CounterColumn.reconcile(), > =A0although, if an update does not find existing entry in memtable, we= need to
> go to sstable to fetch any possible epoch number, so
> compared to current write path, in the "no replicate-on-write&quo= t; case, we need
> to add a read to sstable. but in the "replicate-on-write" ca= se, we already
> read that, so it's no extra time cost. =A0"no replicate-on-wr= ite" is not a
> very useful setup in reality anyway.
>
> does this sound a feasible way? =A0 if this works, expiring counter sh= ould
> also naturally work.
>
> Thanks
> Yang



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.c= om


--20cf30050bfeda6f0004a59c0d7d--