Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17F1718C22 for ; Mon, 3 Aug 2015 17:30:50 +0000 (UTC) Received: (qmail 30601 invoked by uid 500); 3 Aug 2015 17:30:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 30561 invoked by uid 500); 3 Aug 2015 17:30:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 30551 invoked by uid 99); 3 Aug 2015 17:30:46 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Aug 2015 17:30:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id F1F3FD9E76 for ; Mon, 3 Aug 2015 17:30:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.776 X-Spam-Level: **** X-Spam-Status: No, score=4.776 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HK_RANDOM_ENVFROM=0.626, HK_RANDOM_FROM=0.999, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id OZWqleO26rCM for ; Mon, 3 Aug 2015 17:30:34 +0000 (UTC) Received: from mail-yk0-f172.google.com (mail-yk0-f172.google.com [209.85.160.172]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id A3302207CE for ; Mon, 3 Aug 2015 17:30:33 +0000 (UTC) Received: by ykoo205 with SMTP id o205so26258892yko.0 for ; Mon, 03 Aug 2015 10:29:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=VnnCt5VC6ZRZVQwfpKEtK8DuWqIlUorcQazDF8ao1SU=; b=LkO3+dTblXGSkDg7JoYw9dgF5/YuhUXnidYXelCNGE4KtVfFWzqlMxKw1/XD83o6I2 1MqnU6mpfdVHty/IScAchX+yBKA7CP02R8UBJC7axm8Fd0Ggzb3uWSl5X7ahyVidqCX8 4U4XAnd0UCeR5LwaA3EAqFy1EKD56wyGHEf2V5TwrZF7KmLk09NXLp7xov2TcgsQ7K1L h3t6XPZ69do5Yot03Q3gKg48tavu4gBmn+4K3XAry7a8XcXTlhNTks35KEJ58xQLFi1b dJ9d7KC9UHcDQiHFvxyCxw1mfsPzn5oxhuPqqEmj48lOWdT5PUvrVjAxywV8ntyNa7/p Dfyw== X-Received: by 10.170.88.212 with SMTP id f203mr21902423yka.34.1438622942576; Mon, 03 Aug 2015 10:29:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.117.11 with HTTP; Mon, 3 Aug 2015 10:28:43 -0700 (PDT) In-Reply-To: References: From: Yang Date: Mon, 3 Aug 2015 10:28:43 -0700 Message-ID: Subject: Re: linearizable consistency / Paxos ? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a113ac2c27546e7051c6b81cd --001a113ac2c27546e7051c6b81cd Content-Type: text/plain; charset=UTF-8 Thanks a lot for the info! I see, the paxos protocol used now in the code is actually the "single-decree synod" protocol, which votes on only one value. the scope of the implementation is only the CAS operation (which contains a read and write), not a generic txn (which could contain arbitrarily many operations). for the generic txn the multi-degree protocol would be needed. here CAS is able to work on top of the synod because the read is essentially "sandwitched/bounded" between the proposal and accept, so that no other ballot can get in between (the line https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L273 checks this). On Mon, Aug 3, 2015 at 4:29 AM, DuyHai Doan wrote: > "you seem to be suggesting that the "other operations on the same > partition key have to wait" because Paxos grouped the first series > together, which have to be committed in the same order , before all other > operations, essentially ___serializing___ the operations (with guaranteed > same order)." --> No, the implementation does not group any Paxos > operation together. And when I said about (INSERT, UPDATE, DELETE ...) I > didn't mean a group of operations, just individual INSERT or UPDATE or > DELETE operation that can occur at any moment. > > Indeed there are 3 scenarios for dueling proposals P1 & P2: > > 1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1 > will abort, sleep for a random amount of time and re-propose later. This is > in order to give P2 a chance to complete its Paxos round. > > 2. P1 has been accepted (phase Propose/Accept successful) and P2 has > higher ballot than P1, then the coordinator that issued P2 has to commit P1 > first before re-proposing P2 > > 3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise > phase > > "I guess Cassandra must be doing something to prevent "the second guy > injecting his operation before DELETE" in the above scenario" --> Since > there is no grouping of Paxos operations (not to be confused with BATCH > statement with one Paxos operation!), C* does nothing to prevent a second > client to issue a PAXOS between the UPDATE and DELETE. > > If you're interested, you can look into the source code here: > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202 > > The Javadoc is also interesting to read because it explains briefly the > semantics > > > > On Mon, Aug 3, 2015 at 11:32 AM, Yang wrote: > >> thanks for your answer DuyHai. >> >> I understand Paxos. but I think your description seems missing one >> important point: in the example you gave, "a series of ongoing operation >> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other >> operations on the same partition key have to wait" because Paxos grouped >> the first series together, which have to be committed in the same order , >> before all other operations, essentially ___serializing___ the operations >> (with guaranteed same order). >> >> but Paxos ONLY guarantees order of the operations as they are proposed. >> Paxos itself can not control when a operation is proposed. for example in >> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully >> allowed to propose his operation (say another UPDATE) before DELETE is >> proposed, and hence get the latest ballot number (smaller than that for >> DELETE), so the final committed sequence is INSERT UPDATE >> op_from_another_guy, DELETE ...... >> >> I guess Cassandra must be doing something to prevent "the second guy >> injecting his operation before DELETE" in the above scenario, that seems to >> be some transaction manager which is not yet clearly described in the >> slides u gave. >> >> if that is correct, >> my point is, if we let the above transaction manager work with the >> standard replication protocol, don't we also get transaction behavior? >> >> >> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan >> wrote: >> >>> "what is the fundamental difference between the standard replication >>> protocol and Paxos that prevents us from implementing a 2-pc on top of the >>> standard protocol?" >>> >>> --> for a more detailed description of Paxos, look here: >>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41 >>> >>> Long story short, when there is an ongoing operation (INSERT, UPDATE, >>> DELETE, ...) on a particular partition key with Paxos, any other concurrent >>> operation on the same partition key will have to wait until the ongoing >>> operation commits. >>> >>> If the ongoing operation is validated by Paxos but fails before being >>> able to commit (after the accept phase in the diagram), then any subsequent >>> operation on this partition key will commit this stalled operation before >>> starting its own. >>> >>> >>> >>> On Mon, Aug 3, 2015 at 4:30 AM, Yang wrote: >>> >>>> this link >>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 >>>> talks about linearizable consistency and lightweight transactions. >>>> >>>> but I am still not completely following it , just based on the article >>>> itself. >>>> >>>> the standard replication protocol in Cassandra does establish a total >>>> order (based on client TS, though that can be wrong/unfair), so in the >>>> case of the example mentioned in the article "if 2 people try to create the >>>> same account', yes if both of them just brute-force write, ultimately we >>>> will have one winner, who provided the higher TS (this is done consistently >>>> across all replicas). >>>> >>>> what really matters in the above situation is the ability to group the >>>> 2 operations "check existing account" and "create account" together and run >>>> them in an atomic way. so we need something like a 2-phase commit. >>>> >>>> I guess what is not clear from that article is , what is the >>>> fundamental difference between the standard replication protocol and Paxos >>>> that prevents us from implementing a 2-pc on top of the standard protocol? >>>> >>>> Thanks! >>>> yang >>>> >>> >>> >> > --001a113ac2c27546e7051c6b81cd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks a lot for the info!


I see, =C2=A0the paxos protocol used now in the code is actually the &qu= ot;single-decree synod" =C2=A0protocol, which votes on only one value.=

the scope of the implementation is only the = CAS operation (which contains a read and write), not a generic txn (which c= ould contain arbitrarily many operations).

= =C2=A0for the generic txn the multi-degree protocol would be needed. here C= AS is able to work on top of the synod because the read is essentially &quo= t;sandwitched/bounded" between the proposal and accept, so that no oth= er ballot can get in between (the line https://github.com/apache/cassandra/blob/trunk/src/java/org/apache= /cassandra/service/StorageProxy.java#L273 checks this).

On Mon, Aug 3, 2015 a= t 4:29 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
"you seem to be suggesting that the "other = operations on the same partition key have to wait" because Paxos group= ed the first series together, which have to be committed in the same order = , before all other operations, essentially ___serializing___ the operations= (with guaranteed same order)." =C2=A0--> No, the implementation does not group any Paxos = operation together. And when I said about (INSERT, UPDATE, DELETE ...) I di= dn't mean a group of operations, just individual INSERT or UPDATE or DE= LETE operation that can occur at any moment.

Indeed there are 3 scenarios f= or dueling proposals P1 & P2:

1. P1 has not be= en accepted yet and P2 has higher ballot than P1, then P1 will abort, sleep= for a random amount of time and re-propose later. This is in order to give= P2 a chance to complete its Paxos round.

2. P1 ha= s been accepted (phase Propose/Accept successful) and P2 has higher ballot = than P1, then the coordinator that issued P2 has to commit P1 first before = re-proposing P2

3. P2 has lower ballot than P1, th= en P2 is rejected at Prepare/Promise phase

"<= span style=3D"font-size:12.8000001907349px">I guess Cassandra must be doing= something to prevent "the second guy injecting his operation before D= ELETE" in the above scenario" --> Since there is no grouping o= f Paxos operations (not to be confused with BATCH statement with one Paxos = operation!), C* does nothing to prevent a second client to issue a PAXOS be= tween the UPDATE and DELETE.


The Javadoc is also interesting = to read because it explains briefly the semantics

=

On Mon, Aug 3, 2015 at 11:32 AM, Yan= g <teddyyyy123@gmail.com> wrote:
thanks for your answer DuyHai.

I understand Paxos. but I think your description seems missing one impor= tant point: in the example you gave, "a series of ongoing operation (I= NSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "= ;other operations on the same partition key have to wait" because Paxo= s grouped the first series together, which have to be committed in the same= order , before all other operations, essentially ___serializing___ the ope= rations (with guaranteed same order).=C2=A0

but Pa= xos ONLY guarantees order of the operations as they are proposed. Paxos its= elf can not control when a operation is proposed. for example in the above = sequence . INSERT, UPDATE , DELETE,.... the second guy is fully allowed to = propose his operation (say another UPDATE) before DELETE is proposed, and h= ence get the latest ballot number (smaller than that for DELETE), so the fi= nal committed sequence is INSERT UPDATE =C2=A0 op_from_another_guy, DELETE = ......

I guess Cassandra must be doing something t= o prevent "the second guy injecting his operation before DELETE" = in the above scenario, that seems to be some transaction manager which is n= ot yet clearly described in the slides u gave.

if = that is correct,
my point is, if we let =C2=A0the above transacti= on manager work with the standard replication protocol, don't we also g= et transaction behavior?


On Mon, Aug 3, 2015 at 12:14 AM= , DuyHai Doan <doanduyhai@gmail.com> wrote:
"what is the fundamental difference between the sta= ndard replication protocol and Paxos that prevents us from implementing a 2= -pc on top of the standard protocol?"=C2=A0

--> for a more detailed description o= f Paxos, look here:=C2=A0http:= //www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/= 41

Long story short, when there is an ongoing = operation (INSERT, UPDATE, DELETE, ...) on a particular partition key with = Paxos, any other concurrent operation on the same partition key will have t= o wait until the ongoing operation commits.

If the= ongoing operation is validated by Paxos but fails before being able to com= mit (after the accept phase in the diagram), then any subsequent operation = on this partition key will commit this stalled operation before starting it= s own.



On Mon, Aug 3, 2015 at 4:30 AM, Ya= ng <teddyyyy123@gmail.com> wrote:
this link=C2=A0http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0=
talks about linearizable consistency and lightweight transactions.=

but I am still not completely following it , just= based on the article itself.

the standard replica= tion protocol in Cassandra does establish a total order (based on client TS= , though that can be wrong/unfair), =C2=A0so in the case of the example men= tioned in the article "if 2 people try to create the same account'= , yes if both of them just brute-force write, ultimately we will have one w= inner, who provided the higher TS (this is done consistently across all rep= licas).=C2=A0

what really matters in the above sit= uation is the ability to group the 2 operations "check existing accoun= t" and "create account" together and run them in an atomic w= ay. =C2=A0so we need something like a 2-phase commit.

<= div>I guess what is not clear from that article is , what is the fundamenta= l difference between the standard replication protocol and Paxos that preve= nts us from implementing a 2-pc on top of the standard protocol?
=
Thanks!
yang




--001a113ac2c27546e7051c6b81cd--