Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 7987 invoked from network); 24 Feb 2011 19:48:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Feb 2011 19:48:22 -0000 Received: (qmail 46224 invoked by uid 500); 24 Feb 2011 19:48:19 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 46174 invoked by uid 500); 24 Feb 2011 19:48:19 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 46166 invoked by uid 99); 24 Feb 2011 19:48:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 19:48:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chirayithaj@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Feb 2011 19:48:11 +0000 Received: by fxm15 with SMTP id 15so923510fxm.31 for ; Thu, 24 Feb 2011 11:47:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=qI+bNSuPjeRn0IoqCCZDUWjDDoqqjfkMWtKOel8ZlnE=; b=NcBvgpUAYJyFUKLGWtTekJt6qf7DQ5DG0Q1/4MjlXa3s/MZw9G1ggd+ZGtEyidJfsA DXhKB3DYhaU9g/2IEsVx2ZY3/zvSdO55O5Vx+0+ru5tZqrRiAYy2Z9o5n2/ugL59uyTf i6bHpwSczbpp4kbcKPp81Fmr5p/Ax8tkQeHxQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=nLZQCfLQsa+B0lZEzcXVn343lKGzBuAOc12LsAb1/oGo1brRtq/WCu+MC5PWgY6Mu1 +nnDLYkekGPFDBP320VruBkKt38EO1B8e6vAOe8WZz/S0r/CPCuglEiaJ7wi64K/oL8s 3U0VARUQDD2oPKs4QDUreoyKhgMGohMqO9IJ0= MIME-Version: 1.0 Received: by 10.223.96.206 with SMTP id i14mr1528379fan.67.1298576832343; Thu, 24 Feb 2011 11:47:12 -0800 (PST) Received: by 10.223.151.2 with HTTP; Thu, 24 Feb 2011 11:47:12 -0800 (PST) In-Reply-To: References: Date: Thu, 24 Feb 2011 13:47:12 -0600 Message-ID: Subject: Re: New Chain for : Does Cassandra use vector clocks From: Anthony John To: A J Cc: user@cassandra.apache.org, Sylvain Lebresne Content-Type: multipart/alternative; boundary=20cf30433ed2ce30e0049d0c7ae4 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30433ed2ce30e0049d0c7ae4 Content-Type: text/plain; charset=ISO-8859-1 The leap of faith here is that an error does not mean a clean backing out to prior state - as we are used to with databases. It means that the operation in error could have gone through partially Again, this is not an absolutely unfamiliar territory and can be dealt with. -JA On Thu, Feb 24, 2011 at 1:16 PM, A J wrote: > >>but could be broken in case of a failed write<< > You can think of a scenario where R + W >N still leads to > inconsistency even for successful writes. Say you keep W=1 and R=N . > Lets say the one node where a write happened with success goes down > before it made to the other N-1 nodes. Lets say it goes down for good > and is unrecoverable. The only option is to build a new node from > scratch from other active nodes. This will lead to a write that was > lost and you will end up serving stale copy of it. > > It is better to talk in terms of use cases and if cassandra will be a > fit for it. Otherwise unless you have W=R=N and fsync before each > write commit, there will be scope for inconsistency. > > > On Thu, Feb 24, 2011 at 1:25 PM, Anthony John > wrote: > > I see the point - apologies for putting everyone through this! > > It was just militating against my mental model. > > In summary, here is my take away - simple stuff but - IMO - important to > > conclude this thread (I hope):- > > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event > > should be immediately followed by the same write going to a connection on > to > > another node ( potentially using connection caches of client > implementations > > ) or a Read at CL of All. Because a write could have partially gone > through. > > 2. Timestamps are used in determining the latest version ( correcting the > > false impression I was propagating) > > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in > > case of a failed write as it is unsure whether the new value got written > on > > any server or not. Is that a fair characterization ? > > Bottom line - unlike traditional DBMS, errors do not ensure automatic > > cleanup and revert back, app code has to follow up if immediate - and > not > > eventual - consistency is desired. I made that leap in almost all cases > - I > > think - but the case of a failed write. > > My bad and I can live with this! > > Regards, > > -JA > > > > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne > > > wrote: > >> > >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John > >> wrote: > >>> > >>> Completely understand! > >>> All that I am quibbling over is whether a CL of quorum guarantees > >>> consistency or not. That is what the documentation says - right. IF for > a CL > >>> of Q read - it depends on which node returns read first to determine > the > >>> actual returned result or other more convoluted conditions , then a > Quorum > >>> read/write is not consistent, by any definition. > >> > >> But that's the point. The definition of consistency we are talking about > >> has no meaning if you consider only a quorum read. The definition (which > is > >> the de facto definition of consistency in 'eventually consistent') make > >> sense if we talk about a write followed by a read. And it is > >> considering succeeding write followed by succeeding read. > >> And that is the statement the wiki is making. > >> Honestly, we could debate forever on the definition of consistency and > >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W > >> replica and then a (succeeding) read on R replica and if R+W>N, then it > is > >> guaranteed that the read will see the preceding write. And this is what > is > >> called consistency in the context of eventual consistency (which is not > the > >> context of ACID). > >> If this is not the definition of consistency you had in mind then by all > >> mean, Cassandra probably don't guarantee this definition. But given that > the > >> paragraph preceding what you pasted state clearly we are not talking > about > >> ACID consistency, but eventual consistency, I don't think the wiki is > making > >> any unfair statement. > >> That being said, the wiki may not be always as clear as it could. But > it's > >> an editable wiki :) > >> -- > >> Sylvain > >> > >>> > >>> I can still use Cassandra, and will use it, luv it!!! But let us not > make > >>> this statement on the Wiki architecture section:- > >>> ------------------------------------------------------------- > >>> > >>> More specifically: R=read replica count W=write replica > >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1) > >>> > >>> If W + R > N, you will have consistency > >>> > >>> W=1, R=N > >>> W=N, R=1 > >>> W=Q, R=Q where Q = N / 2 + 1 > >>> > >>> Cassandra provides consistency when R + W > N (read replica count > + write > >>> replica count > replication factor). > >>> > >>> ---------------------------------------------------- > >>> > >>> . > >>> > >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne < > sylvain@datastax.com> > >>> wrote: > >>>> > >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John > >>>> wrote: > >>>>> > >>>>> If you are correct and you are probably closer to the code - then CL > of > >>>>> Quorum does not guarantee a consistency. > >>>> > >>>> If the operation succeed, it does (for some definition of consistency > >>>> which is, following reads at Quorum will be guaranteed to see the new > value > >>>> of a update at quorum). If it fails, then no, it does not guarantee > >>>> consistency. > >>>> It is important to note that the word consistency has multiple > meaning. > >>>> In particular, when we are talking of consistency in Cassandra, we are > not > >>>> talking of the same definition as the C in ACID > >>>> (see: > http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) > >>>>> > >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne > >>>>> wrote: > >>>>>> > >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John < > chirayithaj@gmail.com> > >>>>>> wrote: > >>>>>>>> > >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is > >>>>>>>> >> part of the application logic!!! > >>>>>>> > >>>>>>> >>What is you definition of conflict resolution ? Because if you > >>>>>>> >> update twice the same column (which > >>>>>>> >>I'll call a conflict), then the timestamps are used to decide > which > >>>>>>> >> update wins (which I'll call a resolution). > >>>>>>> I understand what you are saying, and yes semantics is very > important > >>>>>>> here. And yes we are responding to the immediate questions without > covering > >>>>>>> all questions in the thread. > >>>>>>> The point being made here is that the timestamp of the column is > not > >>>>>>> used by Cassandra to figure out what data to return. > >>>>>> > >>>>>> Not quite true. > >>>>>>> > >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 > >>>>>>> A Quorum Write comes and add/updates the time stamp (TS2) of a > >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the > write is > >>>>>>> returned as failed - right ? > >>>>>>> Now Quorum read comes in for exactly the same piece of data that > the > >>>>>>> write failed for. > >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) > >>>>>>> And the read succeeds - Will it return TS1 or TS2. > >>>>>>> I submit it will return TS1 - the old TS. > >>>>>> > >>>>>> It all depends on which (first 2) nodes respond to the read (since > >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that > makes the > >>>>>> quorum, then TS2 will be returned, because cassandra will compare > the > >>>>>> timestamp and decide what to return based on this. If N2/N3 responds > >>>>>> however, both timestamp will be TS1 and so, after timestamp > resolution, it > >>>>>> will stil be TS1 that will be returned. > >>>>>> So yes timestamp is used for conflict resolution. > >>>>>> In your example, you could get TS1 back because a failed write can > let > >>>>>> you cluster in an inconsistent state. You'd have to retry the quorum > and > >>>>>> only when it succeeds can you be guaranteed that quorum read will > always > >>>>>> return TS2. > >>>>>> This is because when a write fails, Cassandra doesn't guarantee that > >>>>>> the write did not made it in (there is no revert). > >>>>>> > >>>>>>> > >>>>>>> Are we on the same page with this interpretation ? > >>>>>>> Regards, > >>>>>>> -JA > >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Sylvan, > >>>>>>>>> Time stamps are not used for conflict resolution - unless is is > >>>>>>>>> part of the application logic!!! > >>>>>>>> > >>>>>>>> What is you definition of conflict resolution ? Because if you > >>>>>>>> update twice the same column (which > >>>>>>>> I'll call a conflict), then the timestamps are used to decide > which > >>>>>>>> update wins (which I'll call a resolution). > >>>>>>>> > >>>>>>>>> > >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd > >>>>>>>>> products - cages for e.g. - to get ACID type consistency. > >>>>>>>> > >>>>>>>> Then again, you'll have to define what you are calling "lost > >>>>>>>> updates". Provided you use a reasonable consistency level, > Cassandra > >>>>>>>> provides fairly strong durability guarantee, so for some > definition you > >>>>>>>> don't "lose updates". > >>>>>>>> That being said, I never pretended that Cassandra provided any > ACID > >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't > support. If > >>>>>>>> we're talking about the guarantees of transaction, then by all > means, > >>>>>>>> cassandra won't provide it. And yes you can use cages or the like > to get > >>>>>>>> transaction. But that was not the point of the thread, was it ? > The thread > >>>>>>>> is about vector clocks, and that has nothing to do with > transaction (vector > >>>>>>>> clocks certainly don't give you transactions). > >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to > why > >>>>>>>> so far I don't think vector clocks would really provide much for > Cassandra. > >>>>>>>> -- > >>>>>>>> Sylvain > >>>>>>>> > >>>>>>>>> > >>>>>>>>> -JA > >>>>>>>>> > >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Apologies : For some reason my response on the original mail > >>>>>>>>>>> keeps bouncing back, thus this new one! > >>>>>>>>>>> > >>>>>>>>>>> > From the other hand, the same article says: > >>>>>>>>>>> > "For conditional writes to work, the condition must be > >>>>>>>>>>> > evaluated at all update > >>>>>>>>>>> > sites before the write can be allowed to succeed." > >>>>>>>>>>> > > >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be > used > >>>>>>>>>>> > >>>>>>>>>>> Sorry, but I am confused by that entire thread! > >>>>>>>>>>> Questions:- > >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any > >>>>>>>>>>> granularity whether it be row/colF/Col ? > >>>>>>>>>> > >>>>>>>>>> No locking, no. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent > >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data > on different > >>>>>>>>>>> nodes can still mess each other up, right ? > >>>>>>>>>> > >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL, > >>>>>>>>>> updating the same piece of data means the same column value. In > that case, > >>>>>>>>>> the resolution rules are the following: > >>>>>>>>>> - If the updates have a different timestamp, keep the one with > >>>>>>>>>> the higher timestamp. That is, the more recent of two updates > win. > >>>>>>>>>> - It the timestamps are the same, then it compares the values > >>>>>>>>>> (byte comparison) and keep the highest value. This is just to > break ties in > >>>>>>>>>> a consistent manner. > >>>>>>>>>> So if you do two truly concurrent updates (that is from two > place > >>>>>>>>>> at the same instant), then you'll end with one of the update. > This is the > >>>>>>>>>> column level. > >>>>>>>>>> However, if that simple conflict detection/resolution mechanism > is > >>>>>>>>>> not good enough for some of your use case and you need to keep > two > >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the > update don't > >>>>>>>>>> end up in the same column. This is easily achieved by appending > some unique > >>>>>>>>>> identifier to the column name for instance. And when reading, do > a slice and > >>>>>>>>>> reconcile whatever you get back with whatever logic make sense. > If you do > >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would > do. Btw, no > >>>>>>>>>> locking or anything needed. > >>>>>>>>>> In my experience, for most things the timestamp resolution is > >>>>>>>>>> enough. If the same user update twice it's profile picture on > you web site > >>>>>>>>>> at the same microsecond, it's usually fine to end up with one of > the two > >>>>>>>>>> pictures. In the rare case where you need something more > specific, using the > >>>>>>>>>> cassandra data model usually solves the problem easily. The > reason for not > >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't > really found > >>>>>>>>>> much example where it is no the case. > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Sylvain > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > --20cf30433ed2ce30e0049d0c7ae4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The leap of faith here is that an error does not mean a clean backing out t= o prior state - as we are used to with databases. It means that the operati= on in error could have gone through partially

Again, this is no= t an absolutely unfamiliar territory and can be dealt with.

-JA

On Thu, Fe= b 24, 2011 at 1:16 PM, A J <s5alye@gmail.com> wrote:
>>but could be broken in case of a failed write<= <
You can think of a scenario where R + W >N still leads to
inconsistency even for successful writes. Say you keep W=3D1 and R=3DN . Lets say the one node where a write happened with success goes down
before it made to the other N-1 nodes. Lets say it goes down for good
and is unrecoverable. The only option is to build a new node from
scratch from other active nodes. This will lead to a write that was
lost and you will end up serving stale copy of it.

It is better to talk in terms of use cases and if cassandra will be a
fit for it. Otherwise unless you have W=3DR=3DN and fsync before each
write commit, there will be scope for inconsistency.


On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayithaj@gmail.com> wrote:
> I see the point - apologies for putting everyone through this!
> It was just militating against my mental model.
> In summary, here is my take away - simple stuff but - IMO - important = to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an eve= nt
> should be immediately followed by the same write going to a connection= on to
> another node ( potentially using connection caches of client implement= ations
> ) or a Read at CL of All. Because a write could have partially gone th= rough.
> 2. Timestamps are used in determining the latest version ( correcting = the
> false impression I was propagating)
> Finally, wrt "W + R > N for Q CL statement" holds, but co= uld be broken in
> case of a failed write as it is unsure whether the new value got writt= en on
> =A0any server or not. Is that a fair=A0characterization=A0?
> Bottom line - unlike traditional DBMS, errors do not ensure automatic<= br> > cleanup and revert back, app code has to follow up if =A0immediate - a= nd not
> eventual - =A0consistency is desired. I made that leap in almost all c= ases - I
> think - but the case of a failed write.
> My bad and=A0I can live with this!
> Regards,
> -JA
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sylvain@datastax.com>
> wrote:
>>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayithaj@gmail.com>
>> wrote:
>>>
>>> Completely understand!
>>> All that I am quibbling over is whether a CL of quorum guarant= ees
>>> consistency or not. That is what the documentation says - righ= t. IF for a CL
>>> of Q read - it depends on which node returns read first to det= ermine the
>>> actual returned result or other more convoluted conditions , t= hen a Quorum
>>> read/write is not consistent, by any definition.
>>
>> But that's the point. The definition of consistency we are tal= king about
>> has no meaning if you consider only a quorum read. The definition = (which is
>> the de facto definition of consistency in 'eventually consiste= nt') make
>> sense if we talk about a write followed by a read. And it is
>> considering=A0succeeding=A0write followed by succeeding read.
>> And that is the statement the wiki is making.
>> Honestly, we could debate forever on the definition of consistency= and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write = on W
>> replica and then a (succeeding) read on R replica and if R+W>N,= then it is
>> guaranteed that the read will see the preceding write. And this is= what is
>> called consistency in the context of eventual consistency (which i= s not the
>> context of ACID).
>> If this is not the definition of consistency you had in mind then = by all
>> mean, Cassandra probably don't guarantee this definition. But = given that the
>> paragraph preceding what you pasted state clearly we are not talki= ng about
>> ACID consistency, but eventual consistency, I don't think the = wiki is making
>> any unfair statement.
>> That being said, the wiki may not be always as clear as it could. = But it's
>> an editable wiki :)
>> --
>> Sylvain
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let = us not make
>>> this statement on the Wiki architecture section:-
>>> -------------------------------------------------------------<= br> >>>
>>> More specifically:=A0R=3Dread replica count=A0W=3Dwrite replic= a
>>> count=A0N=3Dreplication factor=A0Q=3DQUORUM=A0(Q =3D N / 2 + 1= )
>>>
>>> If W + R > N, you will have consistency
>>>
>>> W=3D1, R=3DN
>>> W=3DN, R=3D1
>>> W=3DQ, R=3DQ where Q =3D N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica= count +=A0write
>>> replica count > replication factor).
>>>
>>> ----------------------------------------------------
>>>
>>> .
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sylvain@datastax.com>
>>> wrote:
>>>>
>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayithaj@gmail.com>
>>>> wrote:
>>>>>
>>>>> If you are correct and you are probably closer to the = code - then CL of
>>>>> Quorum does not guarantee a consistency.
>>>>
>>>> If the operation succeed, it does (for some definition of = consistency
>>>> which is, following reads at Quorum will be guaranteed to = see the new value
>>>> of a update at quorum). If it fails, then no, it does not = guarantee
>>>> consistency.
>>>> It is important to note that the word consistency has mult= iple meaning.
>>>> In particular, when we are talking of consistency in Cassa= ndra, we are not
>>>> talking of the same definition as the C in ACID
>>>> (see:=A0http://www.allthingsdist= ributed.com/2007/12/eventually_consistent.html)
>>>>>
>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> <sylvain@da= tastax.com> wrote:
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <= chirayithaj@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> >>Time stamps are not used for confl= ict resolution - unless is is
>>>>>>>> >> part of the application logic!!!<= br> >>>>>>>
>>>>>>> >>What is you definition of conflict res= olution ? Because if you
>>>>>>> >> update twice the same column (which >>>>>>> >>I'll call a conflict), then the ti= mestamps are used to decide which
>>>>>>> >> update wins (which I'll call a re= solution).
>>>>>>> I understand what you are saying, and yes sema= ntics is very important
>>>>>>> here. And yes we are responding to the immedia= te questions without covering
>>>>>>> all questions in the thread.
>>>>>>> The point being made here is that the timestam= p of the column is not
>>>>>>> used by Cassandra to figure out what data to r= eturn.
>>>>>>
>>>>>> Not quite true.
>>>>>>>
>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1= /2/3
>>>>>>> A Quorum =A0Write comes and add/updates the ti= me stamp (TS2) of a
>>>>>>> particular data element. It succeeds on N1 - f= ails on N2/3. So the write is
>>>>>>> returned as failed - right ?
>>>>>>> Now Quorum read comes in for exactly the same = piece of data that the
>>>>>>> write failed for.
>>>>>>> So N1 has TS2 but both N2/3 have the old TS (s= ay TS1)
>>>>>>> And the read succeeds - Will it return TS1 or = TS2.
>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>
>>>>>> It all depends on which (first 2) nodes respond to= the read (since
>>>>>> RF=3D3, that can any two of N1/N2/N3). If N1 is pa= rt of the two that makes the
>>>>>> quorum, then TS2 will be returned, because cassand= ra will compare the
>>>>>> timestamp and decide what to return based on this.= If N2/N3 responds
>>>>>> however, both timestamp will be TS1 and so, after = timestamp resolution, it
>>>>>> will stil be TS1 that will be returned.
>>>>>> So yes timestamp is used for conflict resolution.<= br> >>>>>> In your example, you could get TS1 back because a = failed write can let
>>>>>> you cluster in an inconsistent state. You'd ha= ve to retry the quorum and
>>>>>> only when it succeeds can you be guaranteed that q= uorum read will always
>>>>>> return TS2.
>>>>>> This is because when a write fails, Cassandra does= n't guarantee that
>>>>>> the write did not made it in (there is no revert).=
>>>>>>
>>>>>>>
>>>>>>> Are we on the same page with this interpretati= on ?
>>>>>>> Regards,
>>>>>>> -JA
>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebr= esne
>>>>>>> <sy= lvain@datastax.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony J= ohn
>>>>>>>> <chirayithaj@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Sylvan,
>>>>>>>>> Time stamps are not used for conflict = resolution - unless is is
>>>>>>>>> part of the application logic!!!
>>>>>>>>
>>>>>>>> What is you definition of conflict resolut= ion ? Because if you
>>>>>>>> update twice the same column (which
>>>>>>>> I'll call a conflict), then the timest= amps are used to decide which
>>>>>>>> update wins (which I'll call a resolut= ion).
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can have "lost updates" = w/Cassandra. You need to to use 3rd
>>>>>>>>> products - cages for e.g. - to get ACI= D type consistency.
>>>>>>>>
>>>>>>>> Then again, you'll have to define what= you are calling "lost
>>>>>>>> updates". Provided you use a reasonab= le consistency level, Cassandra
>>>>>>>> provides fairly strong durability guarante= e, so for some definition you
>>>>>>>> don't "lose updates".
>>>>>>>> That being said,=A0I never pretended that = Cassandra provided any ACID
>>>>>>>> guarantee. ACID relates to transaction, wh= ich Cassandra doesn't support. If
>>>>>>>> we're talking about the guarantees of = transaction, then by all means,
>>>>>>>> cassandra won't provide it. And yes yo= u can use cages or the like to get
>>>>>>>> transaction. But that was not the point of= the thread, was it ? The thread
>>>>>>>> is about vector clocks, and that has nothi= ng to do with transaction (vector
>>>>>>>> clocks certainly don't give you transa= ctions).
>>>>>>>> Sorry if I wasn't clear in my mail, bu= t I was only responding to why
>>>>>>>> so far I don't think vector clocks wou= ld really provide much for Cassandra.
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -JA
>>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylva= in Lebresne
>>>>>>>>> <sylvain@datastax.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, A= nthony John
>>>>>>>>>> <chirayithaj@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Apologies : For some reason my= response on the original mail
>>>>>>>>>>> keeps bouncing back, thus this= new one!
>>>>>>>>>>>
>>>>>>>>>>> > From the other hand, the = same article says:
>>>>>>>>>>> > "For conditional wri= tes to work, the condition must be
>>>>>>>>>>> > evaluated at all update >>>>>>>>>>> > sites before the write ca= n be allowed to succeed."
>>>>>>>>>>> >
>>>>>>>>>>> > This means, that when doi= ng such an update CL=3DALL must be used
>>>>>>>>>>>
>>>>>>>>>>> Sorry, but I am confused by th= at entire thread!
>>>>>>>>>>> Questions:-
>>>>>>>>>>> 1. Does Cassandra implement an= y kind of data locking - at any
>>>>>>>>>>> granularity whether it be row/= colF/Col ?
>>>>>>>>>>
>>>>>>>>>> No locking, no.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2. If the answer to 1 above is= NO! - how does CL ALL prevent
>>>>>>>>>>> conflicts. Concurrent updates = on exactly the same piece of data on different
>>>>>>>>>>> nodes can still mess each othe= r up, right ?
>>>>>>>>>>
>>>>>>>>>> Not sure why you are taking CL.ALL= specifically. But in any CL,
>>>>>>>>>> updating the same piece of data me= ans the same column value. In that case,
>>>>>>>>>> the resolution rules are the follo= wing:
>>>>>>>>>> =A0=A0- If the updates have a diff= erent timestamp, keep the one with
>>>>>>>>>> the higher timestamp. That is, the= more recent of two updates win.
>>>>>>>>>> =A0=A0- It the timestamps are the = same, then it compares the values
>>>>>>>>>> (byte comparison) and keep the hig= hest value. This is just to break ties in
>>>>>>>>>> a consistent manner.
>>>>>>>>>> So if you do two truly concurrent = updates (that is from two place
>>>>>>>>>> at the same instant), then you'= ;ll end with one of the update. This is the
>>>>>>>>>> column level.
>>>>>>>>>> However, if that simple conflict d= etection/resolution mechanism is
>>>>>>>>>> not good enough for some of your u= se case and you need to keep two
>>>>>>>>>> concurrent updates, it is easy eno= ugh. Just make sure that the update don't
>>>>>>>>>> end up in the same column. This is= easily achieved by appending some unique
>>>>>>>>>> identifier to the column name for = instance. And when reading, do a slice and
>>>>>>>>>> reconcile whatever you get back wi= th whatever logic make sense. If you do
>>>>>>>>>> that, congrats, you've roughly= emulated what vector clocks would do. Btw, no
>>>>>>>>>> locking or anything needed.
>>>>>>>>>> In my experience, for most things = the timestamp resolution is
>>>>>>>>>> enough. If the same user update tw= ice it's profile picture on you web site
>>>>>>>>>> at the same microsecond, it's = usually fine to end up with one of the two
>>>>>>>>>> pictures. In the rare case where y= ou need something more specific, using the
>>>>>>>>>> cassandra data model usually solve= s the problem easily. The reason for not
>>>>>>>>>> having vector clocks in Cassandra = is that so far, we haven't really found
>>>>>>>>>> much example where it is no the ca= se.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sylvain
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

--20cf30433ed2ce30e0049d0c7ae4--