Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CABNXB2D3bKoaNSZpPW3wYQAYiR+PQPHFdMJ-Ef-eH8m1UxG6ng@mail.gmail.com>
References: 
 <CAAnh3_9OpNmRZHaVEajyE0nVPDi8iTV-Pmou=13nhWDa9FvV=Q@mail.gmail.com>
 <CABNXB2AVkb1PcQvnKfm=tbVLAEzmK_cnUtHADAxaWQ1rLr0C+g@mail.gmail.com>
 <CAAnh3_8URvqHXtZ6aDKeo0c7uWDL03rpR6oD=7x4jih8i=qoig@mail.gmail.com>
 <CABNXB2D3bKoaNSZpPW3wYQAYiR+PQPHFdMJ-Ef-eH8m1UxG6ng@mail.gmail.com>
From: Yang <teddyyyy123@gmail.com>
Date: Mon, 3 Aug 2015 10:28:43 -0700
Message-ID: 
 <CAAnh3_9pMLw0paBPrFh3unM77ScFqRYViGZ9cMsjtNaA6XYF0g@mail.gmail.com>
Subject: Re: linearizable consistency / Paxos ?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a113ac2c27546e7051c6b81cd

--001a113ac2c27546e7051c6b81cd
Content-Type: text/plain; charset=UTF-8

Thanks a lot for the info!


I see,  the paxos protocol used now in the code is actually the
"single-decree synod"  protocol, which votes on only one value.

the scope of the implementation is only the CAS operation (which contains a
read and write), not a generic txn (which could contain arbitrarily many
operations).

 for the generic txn the multi-degree protocol would be needed. here CAS is
able to work on top of the synod because the read is essentially
"sandwitched/bounded" between the proposal and accept, so that no other
ballot can get in between (the line
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L273
checks this).

On Mon, Aug 3, 2015 at 4:29 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:

> "you seem to be suggesting that the "other operations on the same
> partition key have to wait" because Paxos grouped the first series
> together, which have to be committed in the same order , before all other
> operations, essentially ___serializing___ the operations (with guaranteed
> same order)."  --> No, the implementation does not group any Paxos
> operation together. And when I said about (INSERT, UPDATE, DELETE ...) I
> didn't mean a group of operations, just individual INSERT or UPDATE or
> DELETE operation that can occur at any moment.
>
> Indeed there are 3 scenarios for dueling proposals P1 & P2:
>
> 1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1
> will abort, sleep for a random amount of time and re-propose later. This is
> in order to give P2 a chance to complete its Paxos round.
>
> 2. P1 has been accepted (phase Propose/Accept successful) and P2 has
> higher ballot than P1, then the coordinator that issued P2 has to commit P1
> first before re-proposing P2
>
> 3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise
> phase
>
> "I guess Cassandra must be doing something to prevent "the second guy
> injecting his operation before DELETE" in the above scenario" --> Since
> there is no grouping of Paxos operations (not to be confused with BATCH
> statement with one Paxos operation!), C* does nothing to prevent a second
> client to issue a PAXOS between the UPDATE and DELETE.
>
> If you're interested, you can look into the source code here:
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202
>
> The Javadoc is also interesting to read because it explains briefly the
> semantics
>
>
>
> On Mon, Aug 3, 2015 at 11:32 AM, Yang <teddyyyy123@gmail.com> wrote:
>
>> thanks for your answer DuyHai.
>>
>> I understand Paxos. but I think your description seems missing one
>> important point: in the example you gave, "a series of ongoing operation
>> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other
>> operations on the same partition key have to wait" because Paxos grouped
>> the first series together, which have to be committed in the same order ,
>> before all other operations, essentially ___serializing___ the operations
>> (with guaranteed same order).
>>
>> but Paxos ONLY guarantees order of the operations as they are proposed.
>> Paxos itself can not control when a operation is proposed. for example in
>> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully
>> allowed to propose his operation (say another UPDATE) before DELETE is
>> proposed, and hence get the latest ballot number (smaller than that for
>> DELETE), so the final committed sequence is INSERT UPDATE
>> op_from_another_guy, DELETE ......
>>
>> I guess Cassandra must be doing something to prevent "the second guy
>> injecting his operation before DELETE" in the above scenario, that seems to
>> be some transaction manager which is not yet clearly described in the
>> slides u gave.
>>
>> if that is correct,
>> my point is, if we let  the above transaction manager work with the
>> standard replication protocol, don't we also get transaction behavior?
>>
>>
>> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <doanduyhai@gmail.com>
>> wrote:
>>
>>> "what is the fundamental difference between the standard replication
>>> protocol and Paxos that prevents us from implementing a 2-pc on top of the
>>> standard protocol?"
>>>
>>> --> for a more detailed description of Paxos, look here:
>>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41
>>>
>>> Long story short, when there is an ongoing operation (INSERT, UPDATE,
>>> DELETE, ...) on a particular partition key with Paxos, any other concurrent
>>> operation on the same partition key will have to wait until the ongoing
>>> operation commits.
>>>
>>> If the ongoing operation is validated by Paxos but fails before being
>>> able to commit (after the accept phase in the diagram), then any subsequent
>>> operation on this partition key will commit this stalled operation before
>>> starting its own.
>>>
>>>
>>>
>>> On Mon, Aug 3, 2015 at 4:30 AM, Yang <teddyyyy123@gmail.com> wrote:
>>>
>>>> this link
>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>>> talks about linearizable consistency and lightweight transactions.
>>>>
>>>> but I am still not completely following it , just based on the article
>>>> itself.
>>>>
>>>> the standard replication protocol in Cassandra does establish a total
>>>> order (based on client TS, though that can be wrong/unfair),  so in the
>>>> case of the example mentioned in the article "if 2 people try to create the
>>>> same account', yes if both of them just brute-force write, ultimately we
>>>> will have one winner, who provided the higher TS (this is done consistently
>>>> across all replicas).
>>>>
>>>> what really matters in the above situation is the ability to group the
>>>> 2 operations "check existing account" and "create account" together and run
>>>> them in an atomic way.  so we need something like a 2-phase commit.
>>>>
>>>> I guess what is not clear from that article is , what is the
>>>> fundamental difference between the standard replication protocol and Paxos
>>>> that prevents us from implementing a 2-pc on top of the standard protocol?
>>>>
>>>> Thanks!
>>>> yang
>>>>
>>>
>>>
>>
>

--001a113ac2c27546e7051c6b81cd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks a lot for the info!<div><br></div><div><br></div><d=
iv>I see, =C2=A0the paxos protocol used now in the code is actually the &qu=
ot;single-decree synod&quot; =C2=A0protocol, which votes on only one value.=
</div><div><br></div><div><div>the scope of the implementation is only the =
CAS operation (which contains a read and write), not a generic txn (which c=
ould contain arbitrarily many operations).</div></div><div><br></div><div>=
=C2=A0for the generic txn the multi-degree protocol would be needed. here C=
AS is able to work on top of the synod because the read is essentially &quo=
t;sandwitched/bounded&quot; between the proposal and accept, so that no oth=
er ballot can get in between (the line <a href=3D"https://github.com/apache=
/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.ja=
va#L273">https://github.com/apache/cassandra/blob/trunk/src/java/org/apache=
/cassandra/service/StorageProxy.java#L273</a> checks this).</div></div><div=
 class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Aug 3, 2015 a=
t 4:29 AM, DuyHai Doan <span dir=3D"ltr">&lt;<a href=3D"mailto:doanduyhai@g=
mail.com" target=3D"_blank">doanduyhai@gmail.com</a>&gt;</span> wrote:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>&quot;<span style=3D"fo=
nt-size:12.8000001907349px">you seem to be suggesting that the &quot;other =
operations on the same partition key have to wait&quot; because Paxos group=
ed the first series together, which have to be committed in the same order =
, before all other operations, essentially ___serializing___ the operations=
 (with guaranteed same order).&quot; =C2=A0</span><span style=3D"font-size:=
12.8000001907349px">--&gt; No, the implementation does not group any Paxos =
operation together. And when I said about (INSERT, UPDATE, DELETE ...) I di=
dn&#39;t mean a group of operations, just individual INSERT or UPDATE or DE=
LETE operation that can occur at any moment.</span></div><div><span style=
=3D"font-size:12.8000001907349px"><br></span>Indeed there are 3 scenarios f=
or dueling proposals P1 &amp; P2:</div><div><br></div><div>1. P1 has not be=
en accepted yet and P2 has higher ballot than P1, then P1 will abort, sleep=
 for a random amount of time and re-propose later. This is in order to give=
 P2 a chance to complete its Paxos round.</div><div><br></div><div>2. P1 ha=
s been accepted (phase Propose/Accept successful) and P2 has higher ballot =
than P1, then the coordinator that issued P2 has to commit P1 first before =
re-proposing P2</div><div><br></div><div>3. P2 has lower ballot than P1, th=
en P2 is rejected at Prepare/Promise phase</div><div><br></div><div>&quot;<=
span style=3D"font-size:12.8000001907349px">I guess Cassandra must be doing=
 something to prevent &quot;the second guy injecting his operation before D=
ELETE&quot; in the above scenario&quot; --&gt; Since there is no grouping o=
f Paxos operations (not to be confused with BATCH statement with one Paxos =
operation!), C* does nothing to prevent a second client to issue a PAXOS be=
tween the UPDATE and DELETE.</span></div><div><br></div><div><div>If you=
9;re interested, you can look into the source code here:=C2=A0<a href=3D"ht=
tps://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/=
service/StorageProxy.java#L202" target=3D"_blank">https://github.com/apache=
/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.ja=
va#L202</a></div></div><div><br></div><div>The Javadoc is also interesting =
to read because it explains briefly the semantics</div><div><br></div><div>=
<br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail=
_extra"><br><div class=3D"gmail_quote">On Mon, Aug 3, 2015 at 11:32 AM, Yan=
g <span dir=3D"ltr">&lt;<a href=3D"mailto:teddyyyy123@gmail.com" target=3D"=
_blank">teddyyyy123@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex"><div dir=3D"ltr">thanks for your answer DuyHai.<div><br></div><d=
iv>I understand Paxos. but I think your description seems missing one impor=
tant point: in the example you gave, &quot;a series of ongoing operation (I=
NSERT, UPDATE , DELETE ...) &quot; you seem to be suggesting that the &quot=
;other operations on the same partition key have to wait&quot; because Paxo=
s grouped the first series together, which have to be committed in the same=
 order , before all other operations, essentially ___serializing___ the ope=
rations (with guaranteed same order).=C2=A0</div><div><br></div><div>but Pa=
xos ONLY guarantees order of the operations as they are proposed. Paxos its=
elf can not control when a operation is proposed. for example in the above =
sequence . INSERT, UPDATE , DELETE,.... the second guy is fully allowed to =
propose his operation (say another UPDATE) before DELETE is proposed, and h=
ence get the latest ballot number (smaller than that for DELETE), so the fi=
nal committed sequence is INSERT UPDATE =C2=A0 op_from_another_guy, DELETE =
......</div><div><br></div><div>I guess Cassandra must be doing something t=
o prevent &quot;the second guy injecting his operation before DELETE&quot; =
in the above scenario, that seems to be some transaction manager which is n=
ot yet clearly described in the slides u gave.</div><div><br></div><div>if =
that is correct,</div><div>my point is, if we let =C2=A0the above transacti=
on manager work with the standard replication protocol, don&#39;t we also g=
et transaction behavior?</div><div><br></div></div><div><div><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">On Mon, Aug 3, 2015 at 12:14 AM=
, DuyHai Doan <span dir=3D"ltr">&lt;<a href=3D"mailto:doanduyhai@gmail.com"=
 target=3D"_blank">doanduyhai@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr"><span><div>&quot;<span style=3D"font-=
size:12.8000001907349px">what is the fundamental difference between the sta=
ndard replication protocol and Paxos that prevents us from implementing a 2=
-pc on top of the standard protocol?&quot;=C2=A0</span></div><div><span sty=
le=3D"font-size:12.8000001907349px"><br></span></div></span><div><span styl=
e=3D"font-size:12.8000001907349px">--&gt; for a more detailed description o=
f Paxos, look here:=C2=A0</span><a href=3D"http://www.slideshare.net/doandu=
yhai/distributed-algorithms-for-big-data-geecon/41" target=3D"_blank">http:=
//www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/=
41</a></div><div><br></div><div>Long story short, when there is an ongoing =
operation (INSERT, UPDATE, DELETE, ...) on a particular partition key with =
Paxos, any other concurrent operation on the same partition key will have t=
o wait until the ongoing operation commits.</div><div><br></div><div>If the=
 ongoing operation is validated by Paxos but fails before being able to com=
mit (after the accept phase in the diagram), then any subsequent operation =
on this partition key will commit this stalled operation before starting it=
s own.</div><div><br></div><div><br></div></div><div><div><div class=3D"gma=
il_extra"><br><div class=3D"gmail_quote">On Mon, Aug 3, 2015 at 4:30 AM, Ya=
ng <span dir=3D"ltr">&lt;<a href=3D"mailto:teddyyyy123@gmail.com" target=3D=
"_blank">teddyyyy123@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">this link=C2=A0<a href=3D"http://www.datasta=
x.com/dev/blog/lightweight-transactions-in-cassandra-2-0" target=3D"_blank"=
>http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0=
</a><div>talks about linearizable consistency and lightweight transactions.=
</div><div><br></div><div>but I am still not completely following it , just=
 based on the article itself.</div><div><br></div><div>the standard replica=
tion protocol in Cassandra does establish a total order (based on client TS=
, though that can be wrong/unfair), =C2=A0so in the case of the example men=
tioned in the article &quot;if 2 people try to create the same account&#39;=
, yes if both of them just brute-force write, ultimately we will have one w=
inner, who provided the higher TS (this is done consistently across all rep=
licas).=C2=A0</div><div><br></div><div>what really matters in the above sit=
uation is the ability to group the 2 operations &quot;check existing accoun=
t&quot; and &quot;create account&quot; together and run them in an atomic w=
ay. =C2=A0so we need something like a 2-phase commit.</div><div><br></div><=
div>I guess what is not clear from that article is , what is the fundamenta=
l difference between the standard replication protocol and Paxos that preve=
nts us from implementing a 2-pc on top of the standard protocol?</div><div>=
<br></div><div>Thanks!</div><span><font color=3D"#888888"><div>yang</div></=
font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a113ac2c27546e7051c6b81cd--