Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com
 designates 209.85.218.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=cZdSBfldbpApzkYata0eguWsi8XnAkmG9QPP/01UC0ZGyhSlTWAkv7OtjHiabXz0mI
         OgQ6eCnU1+ExPySAZFEawu2FJ/RpU1C1hC1Ft1O6f5L8APfWYBBjvkfkjfQESlU8FWQk
         C1wnoQtJB9IVHQRGAl6zYqOVjg4H62oHpAYnw=
MIME-Version: 1.0
In-Reply-To: <BANLkTikXy+_SpR0vR4Y9yF=t6cnTa65Mhw@mail.gmail.com>
References: <BANLkTinUjQm_XtOCM2FuDD8WcZNLgC23gQ@mail.gmail.com>
	<BANLkTikE4+4yAoomfXeZMqZE3G00ETeLzw@mail.gmail.com>
	<BANLkTikXy+_SpR0vR4Y9yF=t6cnTa65Mhw@mail.gmail.com>
Date: Mon, 13 Jun 2011 11:26:12 -0700
Message-ID: <BANLkTinUmbfzOsgaGXU_5-+y3dG1jSAViQ@mail.gmail.com>
Subject: Re: one way to make counter delete work better
From: Yang <teddyyyy123@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf30050bfeda6f0004a59c0d7d

--20cf30050bfeda6f0004a59c0d7d
Content-Type: text/plain; charset=ISO-8859-1

ok, I think it's better to understand it this way, then it is really simple
and intuitive:

my proposed way of counter update can be simply seen as a combination of
regular columns + current counter columns:

regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
number]
then within each epoch, counter updates work as currently implemented


On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy123@gmail.com> wrote:

> I think this approach also works for your scenario:
>
> I thought that the issue is only concerned with merging within the same
> leader; but you pointed out
> that a similar merging happens between leaders too, now I see that the same
> rules on epoch number
> also applies to inter-leader data merging, specifically in your case:
>
>
> everyone starts with epoch of 0, ( they should be same, if not, it also
> works, we just consider them to be representing diffferent time snapshots of
> the same counter state)
>
> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>
> node A      delete    clock:  0.200
>
> node B     add 2     clock:  0.300
>
> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
> already produced a delete, with epoch of 0, so A considers epoch 0 already
> ended, it won't accept any replicated state with epoch < 1.
>
> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> updates its future expected epoch to 1.
>
> at this time, the state of system is:
> node A     expected epoch =1  [A:nil] [B:nil]
> same for node B
>
>
>
> let's say we have following further writes:
>
> node B  add 3  clock  1.400
>
> node A adds 4  clock 1.500
>
> node B receives A's add 4,   node B updates its copy of A
> node A receives B's add 3,    updates its copy of B
>
>
> then state is:
> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> node B same
>
>
>
> generally I think it should be complete if we add the following rule for
> inter-leader replication:
>
> each leader keeps a var in memory (and also persist to sstable when
> flushing)  expected_epoch , initially set to 0
>
> node P does:
> on receiving updates from  node Q
>         if Q.expected_epoch > P.expected_epoch
>               /** an epoch bump inherently means a previous delete, which
> we probably missed , so we need to apply the delete
>                   a delete is global to all leaders, so apply it on all my
> replicas **/
>              for all leaders in my vector
>                   count = nil
>
>              P.expected_epoch =  Q.expected_epoch
>         if Q.expected_epoch == P.expected_epoch
>              update P's copy of Q according to standard rules
>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
> up to date than us, just ignore
>
>
> replicate_on_write(to Q):
>       if  P.operation == delete
>             P.expected_epoch ++
>             set all my copies of all leaders to nil
>       send to Q ( P.total , P.expected_epoch)
>
>
>
>
> overall I don't think delete being not commutative is a fundamental blocker
> : regular columns are also not commutative, yet we achieve stable result no
> matter what order they are applied, because of the ordering rule used in
> reconciliation; here we just need to find a similar ordering rule. the epoch
> thing could be a step on this direction.
>
>
> Thanks
> Yang
>
>
>
>
> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> I don't think that's bulletproof either.  For instance, what if the
>> two adds go to replica 1 but the delete to replica 2?
>>
>> Bottom line (and this was discussed on the original
>> delete-for-counters ticket,
>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>> are not fully commutative which makes them fragile.
>>
>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy123@gmail.com> wrote:
>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> > indicates, the problem with counter delete is  in scenarios like the
>> > following:
>> > add 1, clock 100
>> > delete , clock 200
>> > add  2 , clock 300
>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>> > have
>> > delete  clock 200
>> > add 3,  clock 300
>> > which shows wrong result.
>> >
>> > I think a relatively simple extension can be used to complete fix this
>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> clock,
>> > so that
>> >    1) a delete operation increases future epoch number by 1
>> >    2) merging of delta adds can be between only deltas of the same
>> epoch,
>> > deltas of older epoch are simply ignored during merging. merged result
>> keeps
>> > the epoch number of the newest seen.
>> > other operations remain the same as current. note that the above 2 rules
>> are
>> > only concerned with merging within the deltas on the leader, and not
>> related
>> > to the replicated count, which is a simple final state, and observes the
>> > rule of "larger clock trumps". naturally the ordering rule is:
>> epoch1.clock1
>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> clock2
>> > intuitively "epoch" can be seen as the serial number on a new
>> "incarnation"
>> > of a counter.
>> >
>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >  although, if an update does not find existing entry in memtable, we
>> need to
>> > go to sstable to fetch any possible epoch number, so
>> > compared to current write path, in the "no replicate-on-write" case, we
>> need
>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> already
>> > read that, so it's no extra time cost.  "no replicate-on-write" is not a
>> > very useful setup in reality anyway.
>> >
>> > does this sound a feasible way?   if this works, expiring counter should
>> > also naturally work.
>> >
>> > Thanks
>> > Yang
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

--20cf30050bfeda6f0004a59c0d7d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

ok, I think it&#39;s better to understand it this way, then it is really si=
mple and intuitive:<div><br></div><div>my proposed way of counter update ca=
n be simply seen as a combination of regular columns + current counter colu=
mns:</div>
<div><br></div><div>regular column : =A0[ value: &quot;wipes out every buck=
et to nil&quot; =A0 , clock: epoch number]</div><div>then within each epoch=
, counter updates work as currently implemented</div><div><br></div><div><b=
r>
<br><div class=3D"gmail_quote">On Mon, Jun 13, 2011 at 10:12 AM, Yang <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:teddyyyy123@gmail.com">teddyyyy123@gmail=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div>I think this approach also works for your scenario:</div><div><br></di=
v><div>I thought that the issue is only concerned with merging within the s=
ame leader; but you pointed out=A0</div><div>that a similar merging happens=
 between leaders too, now I see that the same rules on epoch number</div>

<div>also applies to inter-leader data merging, specifically in your case:<=
/div><div><br></div><div><br></div><div>everyone starts with epoch of 0, ( =
they should be same, if not, it also works, we just consider them to be rep=
resenting diffferent time snapshots of the same counter state)</div>

<div><br></div><div>node A =A0 =A0 =A0add 1 =A0 =A0clock: =A00.100 =A0(epoc=
h =3D 0, clock number =3D 100)</div><div><br></div><div>node A =A0 =A0 =A0d=
elete =A0 =A0clock: =A00.200</div><div><br></div><div>node B =A0 =A0 add 2 =
=A0 =A0 clock: =A00.300=A0</div><div>
<br>
</div><div>node A =A0 =A0gets B&#39;s state: =A0add 2 clock 0.300, but reje=
cts it because A has already produced a delete, with epoch of 0, so A consi=
ders epoch 0 already ended, it won&#39;t accept any replicated state with e=
poch &lt; 1.</div>

<div><br></div><div>node B =A0 =A0gets A&#39;s delete =A00.200, =A0it zeros=
 its own count of &quot;2&quot;, and updates its future expected epoch to 1=
.</div><div><br></div><div>at this time, the state of system is:</div><div>=
node A =A0 =A0 expected epoch =3D1 =A0[A:nil] [B:nil]</div>

<div>same for node B</div><div><br></div><div><br></div><div><br></div><div=
>let&#39;s say we have following further writes:</div><div><br></div><div>n=
ode B =A0add 3 =A0clock =A01.400</div><div><br></div><div>node A adds 4 =A0=
clock 1.500</div>

<div><br></div><div>node B receives A&#39;s add 4, =A0 node B updates its c=
opy of A</div><div>node A receives B&#39;s add 3, =A0 =A0updates its copy o=
f B</div><div><br></div><div><br></div><div>then state is:</div><div>node A=
 =A0, expected epoch =3D=3D 1 =A0 =A0[A:4 =A0clock=3D400] [B:3 =A0 clock=3D=
500]</div>

<div>node B same</div><div><br></div><div><br></div><div><br></div><div>gen=
erally I think it should be complete if we add the following rule for inter=
-leader replication:</div><div><br></div><div>each leader keeps a var in me=
mory (and also persist to sstable when flushing) =A0expected_epoch , initia=
lly set to 0</div>

<div><br></div><div>node P does:</div><div>on receiving updates from =A0nod=
e Q</div><div>=A0 =A0 =A0 =A0 if Q.expected_epoch &gt; P.expected_epoch=A0<=
/div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 /** an epoch bump inherently means a =
previous delete, which we probably missed , so we need to apply the delete<=
/div>

<div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 a delete is global to all leaders,=
 so apply it on all my replicas **/</div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0fo=
r all leaders in my vector</div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 co=
unt =3D nil</div><div>=A0 =A0 =A0 =A0 =A0 =A0=A0</div><div>=A0 =A0 =A0 =A0 =
=A0 =A0 =A0P.expected_epoch =3D =A0Q.expected_epoch</div>

<div>=A0 =A0 =A0 =A0 if Q.expected_epoch =3D=3D P.expected_epoch</div><div>=
=A0 =A0 =A0 =A0 =A0 =A0 =A0update P&#39;s copy of Q according to standard r=
ules</div><div>=A0 =A0 =A0 =A0 /** if Q.expected_epoch &lt; P.expected_epoc=
h =A0, that means Q is less up to date than us, just ignore=A0</div>

<div><br></div><div><br></div><div>replicate_on_write(to Q):</div><div>=A0 =
=A0 =A0 if =A0P.operation =3D=3D delete</div><div>=A0 =A0 =A0 =A0 =A0 =A0 P=
.expected_epoch ++</div><div>=A0 =A0 =A0 =A0 =A0 =A0 set all my copies of a=
ll leaders to nil</div><div>=A0 =A0 =A0 send to Q ( P.total , P.expected_ep=
och)</div>

<div><br></div><div><br></div><div><br></div><div><br></div><div>overall I =
don&#39;t think delete being not commutative is a fundamental blocker : reg=
ular columns are also not commutative, yet we achieve stable result no matt=
er what order they are applied, because of the ordering rule used in reconc=
iliation; here we just need to find a similar ordering rule. the epoch thin=
g could be a step on this direction.</div>

<div><br></div><div><br></div><div>Thanks</div><div>Yang</div><div><div></d=
iv><div class=3D"h5"><div><br></div><div><br></div><div><br></div><div><br>=
</div><div><div class=3D"gmail_quote">On Mon, Jun 13, 2011 at 9:04 AM, Jona=
than Ellis <span dir=3D"ltr">&lt;<a href=3D"mailto:jbellis@gmail.com" targe=
t=3D"_blank">jbellis@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I don&#39;t think that&#39;s bulletproof eit=
her. =A0For instance, what if the<br>
two adds go to replica 1 but the delete to replica 2?<br>
<br>
Bottom line (and this was discussed on the original<br>
delete-for-counters ticket,<br>
<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-2101" target=3D"=
_blank">https://issues.apache.org/jira/browse/CASSANDRA-2101</a>), counter =
deletes<br>
are not fully commutative which makes them fragile.<br>
<div><div></div><div><br>
On Mon, Jun 13, 2011 at 10:54 AM, Yang &lt;<a href=3D"mailto:teddyyyy123@gm=
ail.com" target=3D"_blank">teddyyyy123@gmail.com</a>&gt; wrote:<br>
&gt; as=A0<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-2101" =
target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-2101</a><=
br>
&gt; indicates, the problem with counter delete is =A0in scenarios like the=
<br>
&gt; following:<br>
&gt; add 1, clock 100<br>
&gt; delete , clock 200<br>
&gt; add =A02 , clock 300<br>
&gt; if the 1st and 3rd operations are merged in SStable compaction, then w=
e<br>
&gt; have<br>
&gt; delete =A0clock 200<br>
&gt; add 3, =A0clock 300<br>
&gt; which shows wrong result.<br>
&gt;<br>
&gt; I think a relatively simple extension can be used to complete fix this=
<br>
&gt; issue: similar to ZooKeeper, we can prefix an &quot;Epoch&quot; number=
 to the clock,<br>
&gt; so that<br>
&gt; =A0 =A01) a delete operation increases future epoch number by 1<br>
&gt; =A0 =A02) merging of delta adds can be between only deltas of the same=
 epoch,<br>
&gt; deltas of older epoch are simply ignored during merging. merged result=
 keeps<br>
&gt; the epoch number of the newest seen.<br>
&gt; other operations remain the same as current. note that the above 2 rul=
es are<br>
&gt; only concerned with merging within the deltas on the leader, and not r=
elated<br>
&gt; to the replicated count, which is a simple final state, and observes t=
he<br>
&gt; rule of &quot;larger clock trumps&quot;. naturally the ordering rule i=
s: epoch1.clock1<br>
&gt;&gt; epoch2.clock2 =A0iff epoch1 &gt; epoch2 || epoch1 =3D=3D epoch2 &a=
mp;&amp; clock1 &gt; clock2<br>
&gt; intuitively &quot;epoch&quot; can be seen as the serial number on a ne=
w &quot;incarnation&quot;<br>
&gt; of a counter.<br>
&gt;<br>
&gt; code change should be mostly localized to CounterColumn.reconcile(),<b=
r>
&gt; =A0although, if an update does not find existing entry in memtable, we=
 need to<br>
&gt; go to sstable to fetch any possible epoch number, so<br>
&gt; compared to current write path, in the &quot;no replicate-on-write&quo=
t; case, we need<br>
&gt; to add a read to sstable. but in the &quot;replicate-on-write&quot; ca=
se, we already<br>
&gt; read that, so it&#39;s no extra time cost. =A0&quot;no replicate-on-wr=
ite&quot; is not a<br>
&gt; very useful setup in reality anyway.<br>
&gt;<br>
&gt; does this sound a feasible way? =A0 if this works, expiring counter sh=
ould<br>
&gt; also naturally work.<br>
&gt;<br>
&gt; Thanks<br>
&gt; Yang<br>
<br>
<br>
<br>
</div></div><font color=3D"#888888">--<br>
Jonathan Ellis<br>
Project Chair, Apache Cassandra<br>
co-founder of DataStax, the source for professional Cassandra support<br>
<a href=3D"http://www.datastax.com" target=3D"_blank">http://www.datastax.c=
om</a><br>
</font></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--20cf30050bfeda6f0004a59c0d7d--