Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 45995 invoked from network); 24 Feb 2011 17:34:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Feb 2011 17:34:37 -0000 Received: (qmail 30216 invoked by uid 500); 24 Feb 2011 17:34:35 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 29986 invoked by uid 500); 24 Feb 2011 17:34:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 29978 invoked by uid 99); 24 Feb 2011 17:34:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 17:34:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chirayithaj@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 17:34:24 +0000 Received: by fxm15 with SMTP id 15so785084fxm.31 for ; Thu, 24 Feb 2011 09:34:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=wJ5aM8GlU1M/v0vXpmfA7VPgx9Kc7IHeBIxz+NjpbJc=; b=qJltq/WvN4cFBjMY1ZLliVY8yrm+rrKWfg+vRdd8+WSBcQkZHUny8wWSM1qs0Talzt CcNphZoISFH8hC/zmqFdzmyhA3ZosS2DKpeX7yHucg6KiR/NIDd+rTvI9RAt4Sib223V HQowvyPl50O47/fENqBVS1be28nyxtv9w0flM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=cvKGd5T/7rTA4vl1lGNdegDSMs442fOWdQCHy/kOqzdk1CVtTFUAqHSA2G0cz/Rusm ZII4kiNXLGk05Jx0biUs+u4jTFzieO2TBwHDJepNz2xALbURF1h4G9Bmox+nJL9Bgtsr bzwfWdese7YxsDS+OVkKUW3cj/eB+zrBstdtw= MIME-Version: 1.0 Received: by 10.223.86.199 with SMTP id t7mr1363458fal.29.1298568804512; Thu, 24 Feb 2011 09:33:24 -0800 (PST) Received: by 10.223.151.2 with HTTP; Thu, 24 Feb 2011 09:33:24 -0800 (PST) In-Reply-To: References: Date: Thu, 24 Feb 2011 11:33:24 -0600 Message-ID: Subject: Re: New Chain for : Does Cassandra use vector clocks From: Anthony John To: Sylvain Lebresne Cc: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf3054a6bb4f331c049d0a9cf9 X-Virus-Checked: Checked by ClamAV on apache.org --20cf3054a6bb4f331c049d0a9cf9 Content-Type: text/plain; charset=ISO-8859-1 Completely understand! All that I am quibbling over is whether a CL of quorum guarantees consistency or not. That is what the documentation says - right. IF for a CL of Q read - it depends on which node returns read first to determine the actual returned result or other more convoluted conditions , then a Quorum read/write is not consistent, by any definition. I can still use Cassandra, and will use it, luv it!!! But let us not make this statement on the Wiki architecture section:- ------------------------------------------------------------- More specifically: R=read replica count W=write replica count N=replication factor Q=*QUORUM* (Q = N / 2 + 1) - If W + R > N, you will have consistency - W=1, R=N - W=N, R=1 - W=Q, R=Q where Q = N / 2 + 1 Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). ---------------------------------------------------- . On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne wrote: > On Thu, Feb 24, 2011 at 6:01 PM, Anthony John wrote: > >> If you are correct and you are probably closer to the code - then CL of >> Quorum does not guarantee a consistency. > > > If the operation succeed, it does (for some definition of consistency which > is, following reads at Quorum will be guaranteed to see the new value of a > update at quorum). If it fails, then no, it does not guarantee consistency. > > It is important to note that the word consistency has multiple meaning. In > particular, when we are talking of consistency in Cassandra, we are not > talking of the same definition as the C in ACID (see: > http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) > >> >> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne wrote: >> >>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John wrote: >>> >>>> >>Time stamps are not used for conflict resolution - unless is is part >>>>> of the application logic!!! >>>>> >>>> >>>> >>What is you definition of conflict resolution ? Because if you update >>>> twice the same column (which >>>> >>I'll call a conflict), then the timestamps are used to decide which >>>> update wins (which I'll call a resolution). >>>> >>>> I understand what you are saying, and yes semantics is very important >>>> here. And yes we are responding to the immediate questions without covering >>>> all questions in the thread. >>>> >>>> The point being made here is that the timestamp of the column is not >>>> used by Cassandra to figure out what data to return. >>>> >>> >>> Not quite true. >>> >>> >>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 >>>> A Quorum Write comes and add/updates the time stamp (TS2) of a >>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is >>>> returned as failed - right ? >>>> Now Quorum read comes in for exactly the same piece of data that the >>>> write failed for. >>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) >>>> And the read succeeds - Will it return TS1 or TS2. >>>> >>>> I submit it will return TS1 - the old TS. >>>> >>> >>> It all depends on which (first 2) nodes respond to the read (since RF=3, >>> that can any two of N1/N2/N3). If N1 is part of the two that makes the >>> quorum, then TS2 will be returned, because cassandra will compare the >>> timestamp and decide what to return based on this. If N2/N3 responds >>> however, both timestamp will be TS1 and so, after timestamp resolution, it >>> will stil be TS1 that will be returned. >>> So yes timestamp is used for conflict resolution. >>> >>> In your example, you could get TS1 back because a failed write can let >>> you cluster in an inconsistent state. You'd have to retry the quorum and >>> only when it succeeds can you be guaranteed that quorum read will always >>> return TS2. >>> >>> This is because when a write fails, Cassandra doesn't guarantee that the >>> write did not made it in (there is no revert). >>> >>> >>>> >>>> Are we on the same page with this interpretation ? >>>> >>>> Regards, >>>> >>>> -JA >>>> >>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne < >>>> sylvain@datastax.com> wrote: >>>> >>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote: >>>>> >>>>>> Sylvan, >>>>>> >>>>>> Time stamps are not used for conflict resolution - unless is is part >>>>>> of the application logic!!! >>>>>> >>>>> >>>>> What is you definition of conflict resolution ? Because if you update >>>>> twice the same column (which >>>>> I'll call a conflict), then the timestamps are used to decide which >>>>> update wins (which I'll call a resolution). >>>>> >>>>> >>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd >>>>>> products - cages for e.g. - to get ACID type consistency. >>>>>> >>>>> >>>>> Then again, you'll have to define what you are calling "lost updates". >>>>> Provided you use a reasonable consistency level, Cassandra provides fairly >>>>> strong durability guarantee, so for some definition you don't "lose >>>>> updates". >>>>> >>>>> That being said, I never pretended that Cassandra provided any ACID >>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If >>>>> we're talking about the guarantees of transaction, then by all means, >>>>> cassandra won't provide it. And yes you can use cages or the like to get >>>>> transaction. But that was not the point of the thread, was it ? The thread >>>>> is about vector clocks, and that has nothing to do with transaction (vector >>>>> clocks certainly don't give you transactions). >>>>> >>>>> Sorry if I wasn't clear in my mail, but I was only responding to why so >>>>> far I don't think vector clocks would really provide much for Cassandra. >>>>> >>>>> -- >>>>> Sylvain >>>>> >>>>> >>>>>> -JA >>>>>> >>>>>> >>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne < >>>>>> sylvain@datastax.com> wrote: >>>>>> >>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John >>>>>> > wrote: >>>>>>> >>>>>>>> Apologies : For some reason my response on the original mail keeps >>>>>>>> bouncing back, thus this new one! >>>>>>>> > From the other hand, the same article says: >>>>>>>> > "For conditional writes to work, the condition must be evaluated >>>>>>>> at all update >>>>>>>> > sites before the write can be allowed to succeed." >>>>>>>> > >>>>>>>> > This means, that when doing such an update CL=ALL must be used >>>>>>>> >>>>>>>> Sorry, but I am confused by that entire thread! >>>>>>>> >>>>>>>> Questions:- >>>>>>>> 1. Does Cassandra implement any kind of data locking - at any >>>>>>>> granularity whether it be row/colF/Col ? >>>>>>>> >>>>>>> >>>>>>> No locking, no. >>>>>>> >>>>>>> >>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent >>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different >>>>>>>> nodes can still mess each other up, right ? >>>>>>>> >>>>>>> >>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL, >>>>>>> updating the same piece of data means the same column value. In that case, >>>>>>> the resolution rules are the following: >>>>>>> - If the updates have a different timestamp, keep the one with the >>>>>>> higher timestamp. That is, the more recent of two updates win. >>>>>>> - It the timestamps are the same, then it compares the values (byte >>>>>>> comparison) and keep the highest value. This is just to break ties in a >>>>>>> consistent manner. >>>>>>> >>>>>>> So if you do two truly concurrent updates (that is from two place at >>>>>>> the same instant), then you'll end with one of the update. This is the >>>>>>> column level. >>>>>>> >>>>>>> However, if that simple conflict detection/resolution mechanism is >>>>>>> not good enough for some of your use case and you need to keep two >>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't >>>>>>> end up in the same column. This is easily achieved by appending some unique >>>>>>> identifier to the column name for instance. And when reading, do a slice and >>>>>>> reconcile whatever you get back with whatever logic make sense. If you do >>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no >>>>>>> locking or anything needed. >>>>>>> >>>>>>> In my experience, for most things the timestamp resolution is enough. >>>>>>> If the same user update twice it's profile picture on you web site at the >>>>>>> same microsecond, it's usually fine to end up with one of the two pictures. >>>>>>> In the rare case where you need something more specific, using the cassandra >>>>>>> data model usually solves the problem easily. The reason for not having >>>>>>> vector clocks in Cassandra is that so far, we haven't really found much >>>>>>> example where it is no the case. >>>>>>> >>>>>>> -- >>>>>>> Sylvain >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --20cf3054a6bb4f331c049d0a9cf9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Completely understand!

All that I am quibbling over is w= hether a CL of quorum guarantees consistency or not. That is what the docum= entation says - right. IF for a CL of Q read - it depends on which node ret= urns read first to determine the actual returned result or other more convo= luted conditions , then a Quorum read/write is not consistent, by any defin= ition.

I can still use Cassandra, and will use it, luv it!!! B= ut let us not make this statement on the Wiki architecture section:-
<= div>
--------------------------------------------------------= -----

More specifically:=A0R=3D= read replica count=A0W=3Dwrit= e replica count=A0N=3Dreplica= tion factor=A0Q=3DQUO= RUM=A0(Q =3D N / 2 + 1)

  • If W + R > N, you will have consistency

  • W=3D1, R=3DN
  • W=3DN, R=3D1
  • W=3DQ, = R=3DQ where Q =3D N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count += =A0write replica count > r= eplication factor).

-= ---------------------------------------------------


.= =A0

On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Le= bresne <sylvai= n@datastax.com> wrote:
On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chi= rayithaj@gmail.com> wrote:
If you are correct and you are probably closer to the code - then CL of Quo= rum does not guarantee a consistency.

If the operation succeed, it does (for some definition of consistency whic= h is, following reads at Quorum will be guaranteed to see the new value of = a update at quorum). If it fails, then no, it does not guarantee consistenc= y.

It is important to note that the word consistency has m= ultiple meaning. In particular, when we are talking of consistency in Cassa= ndra, we are not talking of the same definition as the C in ACID (see:=A0http://www.allthingsdistributed.com/2007/12/eventua= lly_consistent.html)

On = Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
>>Time stamps are not used for conflict resolution - unless is i= s part of the application logic!!!

<= div>>>What is you definition of conflict resolution ? Because if you = update twice the same column (which
>>I'll call a conflict), then the timestamps are used to dec= ide which update wins (which I'll call a resolution).

I understand what you are saying, and yes semantics is very i= mportant here. And yes we are responding to the immediate questions without= covering all questions in the thread.

The point being made here is that the timestamp of the = column is not used by Cassandra to figure out what data to return.

Not quite true.


E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3=A0
A Quorum =A0Write comes and add/updates the time stamp (TS2) of a part= icular data element. It succeeds on N1 - fails on N2/3. So the write is ret= urned as failed - right ?
Now Quorum read comes in for exactly th= e same piece of data that the write failed for.
So N1 has TS2 but both N2/3 have the old TS (say TS1)
And th= e read succeeds - Will it return TS1 or TS2.

I sub= mit it will return TS1 - the old TS.

It all depends on which (first 2) nodes respond to the read (sin= ce RF=3D3, that can any two of N1/N2/N3). If N1 is part of the two that mak= es the quorum, then TS2 will be returned, because cassandra will compare th= e timestamp and decide what to return based on this. If N2/N3 responds howe= ver, both timestamp will be TS1 and so, after timestamp resolution, it will= stil be TS1 that will be returned.=A0
So yes timestamp is used for conflict resolution.

=
In your example, you could get TS1 back because a failed write can let= you cluster in an inconsistent state. You'd have to retry the quorum a= nd only when it succeeds can you be guaranteed that quorum read will always= return TS2.

This is because when a write fails, Cassandra doesn'= ;t guarantee that the write did not made it in (there is no revert).=A0
=A0=A0

Are we on the same page with this interpretation ?

Regards,

-JA

On Thu, Feb 24, 2011 at 10:12 AM, S= ylvain Lebresne <sylvain@datastax.com> wrote:
On Thu, Feb 24, 2011 at 4:52 PM, Anthon= y John <chirayithaj@gmail.com> wrote:
Sylvan,

Time stamps are not used for conflict resolution= - unless is is part of the application logic!!!
What is you definition of conflict resolution ? Because i= f you update twice the same column (which
I'll call a conflict), then the timestamps are used to decide whic= h update wins (which I'll call a resolution).
=A0
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
You can have "lost updates" w/Cassandra. You need to to use = 3rd products - cages for e.g. - to get ACID type consistency.

Then again, you'll have to define what y= ou are calling "lost updates". Provided you use a reasonable cons= istency level, Cassandra provides fairly strong durability guarantee, so fo= r some definition you don't "lose updates".

That being said,=A0I never pretended that Cassandra pro= vided any ACID guarantee. ACID relates to transaction, which Cassandra does= n't support. If we're talking about the guarantees of transaction, = then by all means, cassandra won't provide it. And yes you can use cage= s or the like to get transaction. But that was not the point of the thread,= was it ? The thread is about vector clocks, and that has nothing to do wit= h transaction (vector clocks certainly don't give you transactions).

Sorry if I wasn't clear in my mail, but I was only = responding to why so far I don't think vector clocks would really provi= de much for Cassandra.

--
Sylvain
=A0
-JA=A0


On Thu, Feb 24, 2011 at 7:4= 1 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <chirayithaj@gmail.= com> wrote:
Apologies : For some reason my response on the original mail keeps bouncing= back, thus this new one!

> From the other hand, the same article says:
> "For conditional writes to work, the condition must be evaluated = at all update
> sites before the write can be allowed to succeed.&quo= t;
>
> This means, that when doing such an update CL=3DALL must= be used


Sorry, but I am confused by that entire thread!

Questions:-
1. Does Cassandra implement any kind o= f data locking - at any granularity whether it be row/colF/Col ?

No lock= ing, no.
=A0
2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts. Con= current updates on exactly the same piece of data on different nodes can st= ill mess each other up, right ?

Not sure why you are taking CL.ALL s= pecifically. But in any CL, updating the same piece of data means the same = column value. In that case, the resolution rules are the following:
=A0=A0- If the updates have a different timestamp, keep the one with the hi= gher timestamp. That is, the more recent of two updates win.
=A0= =A0- It the timestamps are the same, then it compares the values (byte comp= arison) and keep the highest value. This is just to break ties in a consist= ent manner.

So if you do two truly concurrent updates (that is from= two place at the same instant), then you'll end with one of the update= . This is the column level.

However, if that simpl= e conflict detection/resolution mechanism is not good enough for some of yo= ur use case and you need to keep two concurrent updates, it is easy enough.= Just make sure that the update don't end up in the same column. This i= s easily achieved by appending some unique identifier to the column name fo= r instance. And when reading, do a slice and reconcile whatever you get bac= k with whatever logic make sense. If you do that, congrats, you've roug= hly emulated what vector clocks would do. Btw, no locking or anything neede= d.

In my experience, for most things the timestamp resolut= ion is enough. If the same user update twice it's profile picture on yo= u web site at the same microsecond, it's usually fine to end up with on= e of the two pictures. In the rare case where you need something more speci= fic, using the cassandra data model usually solves the problem easily. The = reason for not having vector clocks in Cassandra is that so far, we haven&#= 39;t really found much example where it is no the case.
=A0
--
Sylvain
=







--20cf3054a6bb4f331c049d0a9cf9--