Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0D4DA406A for ; Wed, 15 Jun 2011 06:33:54 +0000 (UTC) Received: (qmail 94236 invoked by uid 500); 15 Jun 2011 06:33:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 94179 invoked by uid 500); 15 Jun 2011 06:33:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 94169 invoked by uid 99); 15 Jun 2011 06:33:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2011 06:33:49 +0000 X-ASF-Spam-Status: No, hits=4.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com designates 209.85.161.172 as permitted sender) Received: from [209.85.161.172] (HELO mail-gx0-f172.google.com) (209.85.161.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2011 06:33:44 +0000 Received: by gxk19 with SMTP id 19so81945gxk.31 for ; Tue, 14 Jun 2011 23:33:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=4DczTmqpZxE+E3fbxzJq8HCTbWk/gULwmKqjDNLlcQ8=; b=r/OydDUV2VCBfFOo2EK5TPUjJXAyi8V6RDUaJBGMiSv9Egnr6XPmeqEj+0Vog6uP+v cDrtHUhRSsSS9PP4GZJza6Cs6tHyBX7YSFhM2BmchwC7KY91NGr00bB0Z1LKaVwpbq6P qz8QPgioRMRUyfDBFgBBIHjO0uLrruZnYNTI4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=KxeKN/O6p1PvGfPclYicTUEoxjRyJG+m3bNsiiSiZk0GKKzBaH5X7D3Cg1NJ8uCq4B o+de90UzJt722/FmRBxzOA7vTe9TABEY1KXs7XIanVYqmkCe5+tCYsGbKElUKtzYtDPc id31pMc/CdsneYk+X7bJXVG9ft7zJKtnAuCP4= MIME-Version: 1.0 Received: by 10.236.95.141 with SMTP id p13mr176873yhf.371.1308119601675; Tue, 14 Jun 2011 23:33:21 -0700 (PDT) Received: by 10.236.61.3 with HTTP; Tue, 14 Jun 2011 23:33:21 -0700 (PDT) In-Reply-To: References: Date: Tue, 14 Jun 2011 23:33:21 -0700 Message-ID: Subject: Re: one way to make counter delete work better From: Yang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=002354435d082e7cc004a5ba5494 --002354435d082e7cc004a5ba5494 Content-Type: text/plain; charset=ISO-8859-1 patch in https://issues.apache.org/jira/browse/CASSANDRA-2774 some coding is messy and only intended for demonstration only, we could refine it after we agree this is a feasible way to go. Thanks Yang On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne wrote: > Who assigns those epoch numbers ? > You need all nodes to agree on the epoch number somehow to have this work, > but then how do you maintain those in a partition tolerant distributed > system ? > > I may have missed some parts of your proposal but let me consider a > scenario > that we have to be able to handle: consider two nodes A and B (RF=2) each > in > one data center (DCA and DCB) and a counter c. Suppose you do a +2 > increment > on c that both nodes get. Now let say you have a network split and the > connection > between your 2 data center fails. In DCA you delete c, only A gets it. > In DCB, you > do more increments on c (say +3), only B gets it. The partition can > last for hours. > For deletion to work, we would need that whenever the network > partition is resolved, > both node eventually agree on the value 3 (i.e, only the second increment). > I don't see how you could assign epoch numbers or anything to fix that. > > -- > Sylvain > > On Mon, Jun 13, 2011 at 8:26 PM, Yang wrote: > > ok, I think it's better to understand it this way, then it is really > simple > > and intuitive: > > my proposed way of counter update can be simply seen as a combination of > > regular columns + current counter columns: > > regular column : [ value: "wipes out every bucket to nil" , clock: > epoch > > number] > > then within each epoch, counter updates work as currently implemented > > > > > > On Mon, Jun 13, 2011 at 10:12 AM, Yang wrote: > >> > >> I think this approach also works for your scenario: > >> I thought that the issue is only concerned with merging within the same > >> leader; but you pointed out > >> that a similar merging happens between leaders too, now I see that the > >> same rules on epoch number > >> also applies to inter-leader data merging, specifically in your case: > >> > >> everyone starts with epoch of 0, ( they should be same, if not, it also > >> works, we just consider them to be representing diffferent time > snapshots of > >> the same counter state) > >> node A add 1 clock: 0.100 (epoch = 0, clock number = 100) > >> node A delete clock: 0.200 > >> node B add 2 clock: 0.300 > >> node A gets B's state: add 2 clock 0.300, but rejects it because A > has > >> already produced a delete, with epoch of 0, so A considers epoch 0 > already > >> ended, it won't accept any replicated state with epoch < 1. > >> node B gets A's delete 0.200, it zeros its own count of "2", and > >> updates its future expected epoch to 1. > >> at this time, the state of system is: > >> node A expected epoch =1 [A:nil] [B:nil] > >> same for node B > >> > >> > >> let's say we have following further writes: > >> node B add 3 clock 1.400 > >> node A adds 4 clock 1.500 > >> node B receives A's add 4, node B updates its copy of A > >> node A receives B's add 3, updates its copy of B > >> > >> then state is: > >> node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] > >> node B same > >> > >> > >> generally I think it should be complete if we add the following rule for > >> inter-leader replication: > >> each leader keeps a var in memory (and also persist to sstable when > >> flushing) expected_epoch , initially set to 0 > >> node P does: > >> on receiving updates from node Q > >> if Q.expected_epoch > P.expected_epoch > >> /** an epoch bump inherently means a previous delete, > which > >> we probably missed , so we need to apply the delete > >> a delete is global to all leaders, so apply it on all > my > >> replicas **/ > >> for all leaders in my vector > >> count = nil > >> > >> P.expected_epoch = Q.expected_epoch > >> if Q.expected_epoch == P.expected_epoch > >> update P's copy of Q according to standard rules > >> /** if Q.expected_epoch < P.expected_epoch , that means Q is > less > >> up to date than us, just ignore > >> > >> replicate_on_write(to Q): > >> if P.operation == delete > >> P.expected_epoch ++ > >> set all my copies of all leaders to nil > >> send to Q ( P.total , P.expected_epoch) > >> > >> > >> > >> overall I don't think delete being not commutative is a fundamental > >> blocker : regular columns are also not commutative, yet we achieve > stable > >> result no matter what order they are applied, because of the ordering > rule > >> used in reconciliation; here we just need to find a similar ordering > rule. > >> the epoch thing could be a step on this direction. > >> > >> Thanks > >> Yang > >> > >> > >> > >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis > wrote: > >>> > >>> I don't think that's bulletproof either. For instance, what if the > >>> two adds go to replica 1 but the delete to replica 2? > >>> > >>> Bottom line (and this was discussed on the original > >>> delete-for-counters ticket, > >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes > >>> are not fully commutative which makes them fragile. > >>> > >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang wrote: > >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 > >>> > indicates, the problem with counter delete is in scenarios like the > >>> > following: > >>> > add 1, clock 100 > >>> > delete , clock 200 > >>> > add 2 , clock 300 > >>> > if the 1st and 3rd operations are merged in SStable compaction, then > we > >>> > have > >>> > delete clock 200 > >>> > add 3, clock 300 > >>> > which shows wrong result. > >>> > > >>> > I think a relatively simple extension can be used to complete fix > this > >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the > >>> > clock, > >>> > so that > >>> > 1) a delete operation increases future epoch number by 1 > >>> > 2) merging of delta adds can be between only deltas of the same > >>> > epoch, > >>> > deltas of older epoch are simply ignored during merging. merged > result > >>> > keeps > >>> > the epoch number of the newest seen. > >>> > other operations remain the same as current. note that the above 2 > >>> > rules are > >>> > only concerned with merging within the deltas on the leader, and not > >>> > related > >>> > to the replicated count, which is a simple final state, and observes > >>> > the > >>> > rule of "larger clock trumps". naturally the ordering rule is: > >>> > epoch1.clock1 > >>> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > > >>> >> clock2 > >>> > intuitively "epoch" can be seen as the serial number on a new > >>> > "incarnation" > >>> > of a counter. > >>> > > >>> > code change should be mostly localized to CounterColumn.reconcile(), > >>> > although, if an update does not find existing entry in memtable, we > >>> > need to > >>> > go to sstable to fetch any possible epoch number, so > >>> > compared to current write path, in the "no replicate-on-write" case, > we > >>> > need > >>> > to add a read to sstable. but in the "replicate-on-write" case, we > >>> > already > >>> > read that, so it's no extra time cost. "no replicate-on-write" is > not > >>> > a > >>> > very useful setup in reality anyway. > >>> > > >>> > does this sound a feasible way? if this works, expiring counter > >>> > should > >>> > also naturally work. > >>> > > >>> > Thanks > >>> > Yang > >>> > >>> > >>> > >>> -- > >>> Jonathan Ellis > >>> Project Chair, Apache Cassandra > >>> co-founder of DataStax, the source for professional Cassandra support > >>> http://www.datastax.com > >> > > > > > --002354435d082e7cc004a5ba5494 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable patch in=A0= https://issues.apache.org/jira/browse/CASSANDRA-2774

some = coding is messy and only intended for demonstration only, we could refine i= t after we agree this is a feasible way to go.


Thanks
Yang

On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <= span dir=3D"ltr"><sylvain@datast= ax.com> wrote:
Who assigns those epoch numbers ?
You need all nodes to agree on the epoch number somehow to have this work,<= br> but then how do you maintain those in a partition tolerant distributed syst= em ?

I may have missed some parts of your proposal but let me consider a scenari= o
that we have to be able to handle: consider two nodes A and B (RF=3D2) each= in
one data center (DCA and DCB) and a counter c. Suppose you do a +2 incremen= t
on c that both nodes get. Now let say you have a network split and the
connection
between your 2 data center fails. In DCA you delete c, only A gets it.
In DCB, you
do more increments on c (say +3), only B gets it. The partition can
last for hours.
For deletion to work, we would need that whenever the network
partition is resolved,
both node eventually agree on the value 3 (i.e, only the second increment).=
I don't see how you could assign epoch numbers or anything to fix that.=

--
Sylvain

On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy123@gmail.com> wrote:
> ok, I think it's better to understand it this way, then it is real= ly simple
> and intuitive:
> my proposed way of counter update can be simply seen as a combination = of
> regular columns + current counter columns:
> regular column : =A0[ value: "wipes out every bucket to nil"= =A0 , clock: epoch
> number]
> then within each epoch, counter updates work as currently implemented<= br> >
>
> On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy123@gmail.com> wrote:
>>
>> I think this approach also works for your scenario:
>> I thought that the issue is only concerned with merging within the= same
>> leader; but you pointed out
>> that a similar merging happens between leaders too, now I see that= the
>> same rules on epoch number
>> also applies to inter-leader data merging, specifically in your ca= se:
>>
>> everyone starts with epoch of 0, ( they should be same, if not, it= also
>> works, we just consider them to be representing diffferent time sn= apshots of
>> the same counter state)
>> node A =A0 =A0 =A0add 1 =A0 =A0clock: =A00.100 =A0(epoch =3D 0, cl= ock number =3D 100)
>> node A =A0 =A0 =A0delete =A0 =A0clock: =A00.200
>> node B =A0 =A0 add 2 =A0 =A0 clock: =A00.300
>> node A =A0 =A0gets B's state: =A0add 2 clock 0.300, but reject= s it because A has
>> already produced a delete, with epoch of 0, so A considers epoch 0= already
>> ended, it won't accept any replicated state with epoch < 1.=
>> node B =A0 =A0gets A's delete =A00.200, =A0it zeros its own co= unt of "2", and
>> updates its future expected epoch to 1.
>> at this time, the state of system is:
>> node A =A0 =A0 expected epoch =3D1 =A0[A:nil] [B:nil]
>> same for node B
>>
>>
>> let's say we have following further writes:
>> node B =A0add 3 =A0clock =A01.400
>> node A adds 4 =A0clock 1.500
>> node B receives A's add 4, =A0 node B updates its copy of A >> node A receives B's add 3, =A0 =A0updates its copy of B
>>
>> then state is:
>> node A =A0, expected epoch =3D=3D 1 =A0 =A0[A:4 =A0clock=3D400] [B= :3 =A0 clock=3D500]
>> node B same
>>
>>
>> generally I think it should be complete if we add the following ru= le for
>> inter-leader replication:
>> each leader keeps a var in memory (and also persist to sstable whe= n
>> flushing) =A0expected_epoch , initially set to 0
>> node P does:
>> on receiving updates from =A0node Q
>> =A0 =A0 =A0 =A0 if Q.expected_epoch > P.expected_epoch
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 /** an epoch bump inherently means a p= revious delete, which
>> we probably missed , so we need to apply the delete
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 a delete is global to all lead= ers, so apply it on all my
>> replicas **/
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0for all leaders in my vector
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 count =3D nil
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0P.expected_epoch =3D =A0Q.expected_epoc= h
>> =A0 =A0 =A0 =A0 if Q.expected_epoch =3D=3D P.expected_epoch
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0update P's copy of Q according to s= tandard rules
>> =A0 =A0 =A0 =A0 /** if Q.expected_epoch < P.expected_epoch =A0,= that means Q is less
>> up to date than us, just ignore
>>
>> replicate_on_write(to Q):
>> =A0 =A0 =A0 if =A0P.operation =3D=3D delete
>> =A0 =A0 =A0 =A0 =A0 =A0 P.expected_epoch ++
>> =A0 =A0 =A0 =A0 =A0 =A0 set all my copies of all leaders to nil >> =A0 =A0 =A0 send to Q ( P.total , P.expected_epoch)
>>
>>
>>
>> overall I don't think delete being not commutative is a fundam= ental
>> blocker : regular columns are also not commutative, yet we achieve= stable
>> result no matter what order they are applied, because of the order= ing rule
>> used in reconciliation; here we just need to find a similar orderi= ng rule.
>> the epoch thing could be a step on this direction.
>>
>> Thanks
>> Yang
>>
>>
>>
>> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>>
>>> I don't think that's bulletproof either. =A0For instan= ce, what if the
>>> two adds go to replica 1 but the delete to replica 2?
>>>
>>> Bottom line (and this was discussed on the original
>>> delete-for-counters ticket,
>>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>>> are not fully commutative which makes them fragile.
>>>
>>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <
teddyyyy123@gmail.com> wrote:
>>> > as=A0https://issues.apache.org/jira/browse/CASSAN= DRA-2101
>>> > indicates, the problem with counter delete is =A0in scena= rios like the
>>> > following:
>>> > add 1, clock 100
>>> > delete , clock 200
>>> > add =A02 , clock 300
>>> > if the 1st and 3rd operations are merged in SStable compa= ction, then we
>>> > have
>>> > delete =A0clock 200
>>> > add 3, =A0clock 300
>>> > which shows wrong result.
>>> >
>>> > I think a relatively simple extension can be used to comp= lete fix this
>>> > issue: similar to ZooKeeper, we can prefix an "Epoch= " number to the
>>> > clock,
>>> > so that
>>> > =A0 =A01) a delete operation increases future epoch numbe= r by 1
>>> > =A0 =A02) merging of delta adds can be between only delta= s of the same
>>> > epoch,
>>> > deltas of older epoch are simply ignored during merging. = merged result
>>> > keeps
>>> > the epoch number of the newest seen.
>>> > other operations remain the same as current. note that th= e above 2
>>> > rules are
>>> > only concerned with merging within the deltas on the lead= er, and not
>>> > related
>>> > to the replicated count, which is a simple final state, a= nd observes
>>> > the
>>> > rule of "larger clock trumps". naturally the or= dering rule is:
>>> > epoch1.clock1
>>> >> epoch2.clock2 =A0iff epoch1 > epoch2 || epoch1 =3D= =3D epoch2 && clock1 >
>>> >> clock2
>>> > intuitively "epoch" can be seen as the serial n= umber on a new
>>> > "incarnation"
>>> > of a counter.
>>> >
>>> > code change should be mostly localized to CounterColumn.r= econcile(),
>>> > =A0although, if an update does not find existing entry in= memtable, we
>>> > need to
>>> > go to sstable to fetch any possible epoch number, so
>>> > compared to current write path, in the "no replicate= -on-write" case, we
>>> > need
>>> > to add a read to sstable. but in the "replicate-on-w= rite" case, we
>>> > already
>>> > read that, so it's no extra time cost. =A0"no re= plicate-on-write" is not
>>> > a
>>> > very useful setup in reality anyway.
>>> >
>>> > does this sound a feasible way? =A0 if this works, expiri= ng counter
>>> > should
>>> > also naturally work.
>>> >
>>> > Thanks
>>> > Yang
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra = support
>>> http://w= ww.datastax.com
>>
>
>

--002354435d082e7cc004a5ba5494--