Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of tijoriwala.ritesh@gmail.com
 designates 209.85.161.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=j2MOT04+6eDpwE3XxOTp8tP9aRDw+jzUZt6VS7erVnVSJiI+S0O243RILGEqepreCl
         gvKlWCKGjwLu/CQ0EKC5E4DjhHwHBQQR/eGPQ731cT5/kKvXU0NZnxOE9U5Pr5EEgqja
         VQA2Su7H/P+ZYF53shmUINM2AmYrg56BVqvTM=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=fpNOPTunD2190jHeY144O3we7B20xsfrT_=nY@mail.gmail.com>
References: <AANLkTi=PzLdZZZgoY+vmWkRVJB9A_5WFPMAMb3AHyNs3@mail.gmail.com>
	<AANLkTikHWpYzBX0Z3x0E_1jSZdEpuNO-D3UrYCyXoqz4@mail.gmail.com>
	<AANLkTi=UVWHmxFtq=T48QWZr48B0TaE21KzN-2YpH_vM@mail.gmail.com>
	<AANLkTindPaHaw9kWipeuRNpGmKvW90aJzQTVj+WBkRqn@mail.gmail.com>
	<AANLkTinMg+NkTXcpZ9QW_3UoXdGmWS2F129-+UN_P8DZ@mail.gmail.com>
	<AANLkTimNAkD9Q88U51dH5=BCkK3pwaUZh4fniLydcyHh@mail.gmail.com>
	<AANLkTiky6s02SpsQvBkHaDrjHbQ3b=YvLFOANKKiW5X4@mail.gmail.com>
	<AANLkTi==3sj10StL3pY2MvaMwVn+gwCzvnBXf9HQ9PcK@mail.gmail.com>
	<AANLkTikSshW-nrMCrdUOy=StskmSbwoAopHCR4JQS8QQ@mail.gmail.com>
	<AANLkTi=TRtOuEotCKZ6okf9mF6bDLAn6Wpb4ztru-uuE@mail.gmail.com>
	<AANLkTi=fpNOPTunD2190jHeY144O3we7B20xsfrT_=nY@mail.gmail.com>
Date: Thu, 24 Feb 2011 10:50:10 -0800
Message-ID: <AANLkTi=GoZvxEWc0cPWEWCYQ2oJX4p3g8M6jc4ca-3hy@mail.gmail.com>
Subject: Re: New Chain for : Does Cassandra use vector clocks
From: Ritesh Tijoriwala <tijoriwala.ritesh@gmail.com>
To: user@cassandra.apache.org
Cc: Anthony John <chirayithaj@gmail.com>,
 Sylvain Lebresne <sylvain@datastax.com>
Content-Type: multipart/alternative; boundary=000e0cd70e84d7612c049d0bae88

--000e0cd70e84d7612c049d0bae88
Content-Type: text/plain; charset=ISO-8859-1

Thanks all for good detail and clarification. I just wanted to get things
clear and understand correctly what is the expected behavior when working
with Cassandra against various failure conditions so that application can be
designed accordingly and provide proper locking/synchronization if required.

Thanks,
Ritesh

On Thu, Feb 24, 2011 at 10:25 AM, Anthony John <chirayithaj@gmail.com>wrote:

> I see the point - apologies for putting everyone through this!
>
> It was just militating against my mental model.
>
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
>
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
>
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
>
> My bad and I can live with this!
>
> Regards,
>
> -JA
>
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sylvain@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayithaj@gmail.com>wrote:
>>
>>> Completely understand!
>>>
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>>>
>>
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>>
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>>
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>>
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>>
>> --
>> Sylvain
>>
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>>
>>> -------------------------------------------------------------
>>>
>>> More specifically: R=read replica count W=write replica count N=replication
>>> factor Q=*QUORUM* (Q = N / 2 + 1)
>>>
>>>    -
>>>
>>>    If W + R > N, you will have consistency
>>>    - W=1, R=N
>>>    - W=N, R=1
>>>    - W=Q, R=Q where Q = N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>>
>>> ----------------------------------------------------
>>>
>>>
>>> .
>>>
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sylvain@datastax.com
>>> > wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayithaj@gmail.com>wrote:
>>>>
>>>>> If you are correct and you are probably closer to the code - then CL of
>>>>> Quorum does not guarantee a consistency.
>>>>
>>>>
>>>> If the operation succeed, it does (for some definition of consistency
>>>> which is, following reads at Quorum will be guaranteed to see the new value
>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>> consistency.
>>>>
>>>> It is important to note that the word consistency has multiple meaning.
>>>> In particular, when we are talking of consistency in Cassandra, we are not
>>>> talking of the same definition as the C in ACID (see:
>>>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <
>>>>> sylvain@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <chirayithaj@gmail.com>wrote:
>>>>>>
>>>>>>>  >>Time stamps are not used for conflict resolution - unless is is
>>>>>>>> part of the application logic!!!
>>>>>>>>
>>>>>>>
>>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>>> update twice the same column (which
>>>>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>>>>> update wins (which I'll call a resolution).
>>>>>>>
>>>>>>> I understand what you are saying, and yes semantics is very important
>>>>>>> here. And yes we are responding to the immediate questions without covering
>>>>>>> all questions in the thread.
>>>>>>>
>>>>>>> The point being made here is that the timestamp of the column is not
>>>>>>> used by Cassandra to figure out what data to return.
>>>>>>>
>>>>>>
>>>>>> Not quite true.
>>>>>>
>>>>>>
>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>>>>> returned as failed - right ?
>>>>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>>>>> write failed for.
>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>>
>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>>
>>>>>>
>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>>>>> quorum, then TS2 will be returned, because cassandra will compare the
>>>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>>>>> will stil be TS1 that will be returned.
>>>>>> So yes timestamp is used for conflict resolution.
>>>>>>
>>>>>> In your example, you could get TS1 back because a failed write can let
>>>>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>>>>> only when it succeeds can you be guaranteed that quorum read will always
>>>>>> return TS2.
>>>>>>
>>>>>> This is because when a write fails, Cassandra doesn't guarantee that
>>>>>> the write did not made it in (there is no revert).
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Are we on the same page with this interpretation ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> -JA
>>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <
>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Sylvan,
>>>>>>>>>
>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>> part of the application logic!!!
>>>>>>>>>
>>>>>>>>
>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>> update twice the same column (which
>>>>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>
>>>>>>>>
>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>> updates". Provided you use a reasonable consistency level, Cassandra
>>>>>>>> provides fairly strong durability guarantee, so for some definition you
>>>>>>>> don't "lose updates".
>>>>>>>>
>>>>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>>
>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why
>>>>>>>> so far I don't think vector clocks would really provide much for Cassandra.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>>
>>>>>>>>> -JA
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <
>>>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>>>>>>> evaluated at all update
>>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>>>>> >
>>>>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>>>>
>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>>
>>>>>>>>>>> Questions:-
>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No locking, no.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>    - If the updates have a different timestamp, keep the one with
>>>>>>>>>> the higher timestamp. That is, the more recent of two updates win.
>>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>>> (byte comparison) and keep the highest value. This is just to break ties in
>>>>>>>>>> a consistent manner.
>>>>>>>>>>
>>>>>>>>>> So if you do two truly concurrent updates (that is from two place
>>>>>>>>>> at the same instant), then you'll end with one of the update. This is the
>>>>>>>>>> column level.
>>>>>>>>>>
>>>>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>>>>> locking or anything needed.
>>>>>>>>>>
>>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>>> enough. If the same user update twice it's profile picture on you web site
>>>>>>>>>> at the same microsecond, it's usually fine to end up with one of the two
>>>>>>>>>> pictures. In the rare case where you need something more specific, using the
>>>>>>>>>> cassandra data model usually solves the problem easily. The reason for not
>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really found
>>>>>>>>>> much example where it is no the case.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sylvain
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--000e0cd70e84d7612c049d0bae88
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks all for good detail and clarification. I just wanted to get things c=
lear and understand correctly what is the expected behavior when working wi=
th Cassandra against various failure conditions so that application can be =
designed accordingly and provide proper locking/synchronization if required=
.<div>
<br></div><div>Thanks,</div><div>Ritesh<br><br><div class=3D"gmail_quote">O=
n Thu, Feb 24, 2011 at 10:25 AM, Anthony John <span dir=3D"ltr">&lt;<a href=
=3D"mailto:chirayithaj@gmail.com">chirayithaj@gmail.com</a>&gt;</span> wrot=
e:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">I see the point - apologies for putting eve=
ryone through this!<div><br></div><div>It was just militating against my me=
ntal model.</div>
<div><br></div><div>In summary, here is my take away - simple stuff but - I=
MO - important to conclude this thread (I hope):-</div>
<div>1. I was splitting hair over a failed ( partial ) Q Write. Such an eve=
nt should be immediately followed by the same write going to a connection o=
n to another node ( potentially using connection caches of client implement=
ations ) or a Read at CL of All. Because a write could have partially gone =
through.</div>

<div>2. Timestamps are used in determining the latest version ( correcting =
the false impression I was propagating)</div><div><br></div><div>Finally, w=
rt &quot;W + R &gt; N for Q CL statement&quot; holds, but could be broken i=
n case of a failed write as it is unsure whether the new value got written =
on =A0any server or not. Is that a fair=A0characterization=A0? =A0</div>

<div><br></div><div>Bottom line - unlike traditional DBMS, errors do not en=
sure automatic cleanup and revert back, app code has to follow up if =A0imm=
ediate - and not eventual - =A0consistency is desired. I made that leap in =
almost all cases - I think - but the case of a failed write.=A0</div>

<div><br></div><div>My bad and=A0I can live with this!</div><div><br></div>=
<div>Regards,</div><div><br></div><div>-JA<div><div></div><div class=3D"h5"=
><br><br><div class=3D"gmail_quote">On Thu, Feb 24, 2011 at 11:50 AM, Sylva=
in Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datastax.com" t=
arget=3D"_blank">sylvain@datastax.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>On Thu, Feb 24, 2011 at 6:33 PM, Anthon=
y John <span dir=3D"ltr">&lt;<a href=3D"mailto:chirayithaj@gmail.com" targe=
t=3D"_blank">chirayithaj@gmail.com</a>&gt;</span> wrote:<br>

</div><div class=3D"gmail_quote"><div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Completely understand!<div><br></div><div>All that I am quibbling over is w=
hether a CL of quorum guarantees consistency or not. That is what the docum=
entation says - right. IF for a CL of Q read - it depends on which node ret=
urns read first to determine the actual returned result or other more convo=
luted conditions , then a Quorum read/write is not consistent, by any defin=
ition.</div>


</blockquote><div><br></div></div><div>But that&#39;s the point. The defini=
tion of consistency we are talking about has no meaning if you consider onl=
y a quorum read. The definition (which is the de facto definition of consis=
tency in &#39;eventually consistent&#39;) make sense if we talk about a wri=
te followed by a read. And it is considering=A0succeeding=A0write followed =
by succeeding read.</div>


<div>And that is the statement the wiki is making.=A0</div><div><br></div><=
div>Honestly, we could debate forever on the definition of consistency and =
whatnot. Cassandra guaranties that if you do a (succeeding) write on W repl=
ica and then a (succeeding) read on R replica and if R+W&gt;N, then it is g=
uaranteed that the read will see the preceding write. And this is what is c=
alled consistency in the context of eventual consistency (which is not the =
context of ACID).</div>


<div><br></div><div>If this is not the definition of consistency you had in=
 mind then by all mean, Cassandra probably don&#39;t guarantee this definit=
ion. But given that the paragraph preceding what you pasted state clearly w=
e are not talking about ACID consistency, but eventual consistency, I don&#=
39;t think the wiki is making any unfair statement.</div>


<div><br></div><div>That being said, the wiki may not be always as clear as=
 it could. But it&#39;s an editable wiki :)=A0</div><div><br></div><div>--<=
/div><div>Sylvain</div><div><div></div><div><div>=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div><br></div><div>I can still use Cassandra, and will use it, luv it!!! B=
ut let us not make this statement on the Wiki architecture section:-</div><=
div><br></div><div>--------------------------------------------------------=
-----</div>


<div><p style=3D"font-family:sans-serif;font-size:16px">More specifically:=
=A0<span></span>R=3Dread replica count=A0<span></span>W=3Dwrite replica cou=
nt=A0<span></span>N=3Dreplication factor=A0<span></span>Q=3D<strong>QUORUM<=
/strong>=A0(Q =3D N / 2 + 1)<span></span></p>


<ul style=3D"font-family:sans-serif;font-size:16px"><li><p style=3D"margin-=
top:0.25em;margin-right:0px;margin-bottom:0.25em;margin-left:0px">If W + R =
&gt; N, you will have consistency<span></span></p>
</li><li>W=3D1, R=3DN<span></span></li><li>W=3DN, R=3D1<span></span></li><l=
i>W=3DQ, R=3DQ where Q =3D N / 2 + 1<span></span><span></span><span></span>=
</li>
</ul><p style=3D"font-family:sans-serif;font-size:16px">Cassandra provides =
consistency when R + W &gt; N (read replica count +=A0<span></span>write re=
plica count &gt; replication factor).</p>
<p style=3D"font-family:sans-serif;font-size:16px">------------------------=
----------------------------</p><p style=3D"font-family:sans-serif;font-siz=
e:16px"><br></p></div><div>.=A0<div><div></div><div><br>
<br><div class=3D"gmail_quote">On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Le=
bresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datastax.com" target=
=3D"_blank">sylvain@datastax.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div>On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:chirayithaj@gmail.com" target=3D"_blank">chirayithaj@gmail.=
com</a>&gt;</span> wrote:<br></div><div class=3D"gmail_quote"><div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
If you are correct and you are probably closer to the code - then CL of Quo=
rum does not guarantee a consistency.</blockquote><div><br></div></div><div=
>If the operation succeed, it does (for some definition of consistency whic=
h is, following reads at Quorum will be guaranteed to see the new value of =
a update at quorum). If it fails, then no, it does not guarantee consistenc=
y.</div>


<div><br></div><div>It is important to note that the word consistency has m=
ultiple meaning. In particular, when we are talking of consistency in Cassa=
ndra, we are not talking of the same definition as the C in ACID (see:=A0<a=
 href=3D"http://www.allthingsdistributed.com/2007/12/eventually_consistent.=
html" target=3D"_blank">http://www.allthingsdistributed.com/2007/12/eventua=
lly_consistent.html</a>)</div>


<div><div></div><div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div><br><div class=3D"gmail_quote">On =
Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:sylvain@datastax.com" target=3D"_blank">sylvain@datastax.com</=
a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>On Thu, Feb 24, 2011 at 5:34 PM, Anthon=
y John <span dir=3D"ltr">&lt;<a href=3D"mailto:chirayithaj@gmail.com" targe=
t=3D"_blank">chirayithaj@gmail.com</a>&gt;</span> wrote:<br>


</div><div class=3D"gmail_quote"><div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div><blockquote class=3D"gmail_quote" style=3D"margin-top:0px;margin-=
right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-=
left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">

<div>&gt;&gt;Time stamps are not used for conflict resolution - unless is i=
s part of the application logic!!!</div></blockquote><div><br></div></div><=
div>&gt;&gt;What is you definition of conflict resolution ? Because if you =
update twice the same column (which</div>


<div>&gt;&gt;I&#39;ll call a conflict), then the timestamps are used to dec=
ide which update wins (which I&#39;ll call a resolution).</div><div><br></d=
iv></div><div>I understand what you are saying, and yes semantics is very i=
mportant here. And yes we are responding to the immediate questions without=
 covering all questions in the thread.</div>


<div><br></div><div>The point being made here is that the timestamp of the =
column is not used by Cassandra to figure out what data to return.</div></b=
lockquote><div><br></div></div><div>Not quite true.</div><div>
<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex">
<div><br></div><div>E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3=A0</=
div>
<div>A Quorum =A0Write comes and add/updates the time stamp (TS2) of a part=
icular data element. It succeeds on N1 - fails on N2/3. So the write is ret=
urned as failed - right ?</div><div>Now Quorum read comes in for exactly th=
e same piece of data that the write failed for.</div>


<div>So N1 has TS2 but both N2/3 have the old TS (say TS1)</div><div>And th=
e read succeeds - Will it return TS1 or TS2.</div><div><br></div><div>I sub=
mit it will return TS1 - the old TS.</div></blockquote><div><br></div>


</div><div>It all depends on which (first 2) nodes respond to the read (sin=
ce RF=3D3, that can any two of N1/N2/N3). If N1 is part of the two that mak=
es the quorum, then TS2 will be returned, because cassandra will compare th=
e timestamp and decide what to return based on this. If N2/N3 responds howe=
ver, both timestamp will be TS1 and so, after timestamp resolution, it will=
 stil be TS1 that will be returned.=A0</div>


<div>So yes timestamp is used for conflict resolution.</div><div><br></div>=
<div>In your example, you could get TS1 back because a failed write can let=
 you cluster in an inconsistent state. You&#39;d have to retry the quorum a=
nd only when it succeeds can you be guaranteed that quorum read will always=
 return TS2.</div>


<div><br></div><div>This is because when a write fails, Cassandra doesn&#39=
;t guarantee that the write did not made it in (there is no revert).=A0</di=
v><div><div></div><div><div>=A0=A0</div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><br></div><div>Are we on the same page with this interpretation ?</div=
>
<div><br></div><div>Regards,</div><div><br></div><div>-JA</div><div><div></=
div><div><br><div class=3D"gmail_quote">On Thu, Feb 24, 2011 at 10:12 AM, S=
ylvain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datastax.co=
m" target=3D"_blank">sylvain@datastax.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>On Thu, Feb 24, 2011 at 4:52 PM, Anthon=
y John <span dir=3D"ltr">&lt;<a href=3D"mailto:chirayithaj@gmail.com" targe=
t=3D"_blank">chirayithaj@gmail.com</a>&gt;</span> wrote:<br>


</div><div class=3D"gmail_quote"><div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Sylvan,<div><br></div><div>Time stamps are not used for conflict resolution=
 - unless is is part of the application logic!!!</div></blockquote><div><br=
></div></div><div>What is you definition of conflict resolution ? Because i=
f you update twice the same column (which</div>


<div>I&#39;ll call a conflict), then the timestamps are used to decide whic=
h update wins (which I&#39;ll call a resolution).</div><div><div>=A0</div><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex">


<div>You can have &quot;lost updates&quot; w/Cassandra. You need to to use =
3rd products - cages for e.g. - to get ACID type consistency.</div></blockq=
uote><div><br></div></div><div>Then again, you&#39;ll have to define what y=
ou are calling &quot;lost updates&quot;. Provided you use a reasonable cons=
istency level, Cassandra provides fairly strong durability guarantee, so fo=
r some definition you don&#39;t &quot;lose updates&quot;.</div>


<div><br></div><div>That being said,=A0I never pretended that Cassandra pro=
vided any ACID guarantee. ACID relates to transaction, which Cassandra does=
n&#39;t support. If we&#39;re talking about the guarantees of transaction, =
then by all means, cassandra won&#39;t provide it. And yes you can use cage=
s or the like to get transaction. But that was not the point of the thread,=
 was it ? The thread is about vector clocks, and that has nothing to do wit=
h transaction (vector clocks certainly don&#39;t give you transactions).</d=
iv>


<div><br></div><div>Sorry if I wasn&#39;t clear in my mail, but I was only =
responding to why so far I don&#39;t think vector clocks would really provi=
de much for Cassandra.</div><div><br></div><div>--</div><div>Sylvain</div>


<div><div></div><div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div></div><div>-JA=A0<div><di=
v></div><div><br><br><div class=3D"gmail_quote">On Thu, Feb 24, 2011 at 7:4=
1 AM, Sylvain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@data=
stax.com" target=3D"_blank">sylvain@datastax.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:chirayithaj@gmail.com" target=3D"_blank">chirayithaj@gmail.=
com</a>&gt;</span> wrote:<br></div><div class=3D"gmail_quote"><div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Apologies : For some reason my response on the original mail keeps bouncing=
 back, thus this new one!<div><h1 style=3D"margin-top:12px;margin-right:5px=
;margin-bottom:5px;margin-left:10px;padding-top:0px;padding-right:0px;paddi=
ng-bottom:0px;padding-left:0px;color:rgb(0, 0, 0);background:inherit;border=
-right:inherit">


<font size=3D"2"><span style=3D"font-weight:normal"><span style=3D"border-c=
ollapse:collapse;color:rgb(80, 0, 80);font-family:arial, sans-serif;font-si=
ze:13px">&gt; From the other hand, the same article says:<br>
&gt; &quot;For conditional writes to work, the condition must be evaluated =
at all update<br>&gt; sites before the write can be allowed to succeed.&quo=
t;<br>&gt;<br>&gt; This means, that when doing such an update CL=3DALL must=
 be used</span></span></font></h1>


</div><div><font size=3D"2"><span style=3D"font-weight:normal"><span style=
=3D"border-collapse:collapse;color:rgb(80, 0, 80);font-family:arial, sans-s=
erif;font-size:13px"><br>
</span></span></font></div><div><font size=3D"2"><span style=3D"font-weight=
:normal"><span style=3D"border-collapse:collapse;color:rgb(80, 0, 80);font-=
family:arial, sans-serif;font-size:13px"><span style=3D"color:rgb(0, 0, 0)"=
>Sorry, but I am confused by that entire thread!<div>


<br></div><div>Questions:-</div><div>1. Does Cassandra implement any kind o=
f data locking - at any granularity whether it be row/colF/Col ?</div></spa=
n></span></span></font></div></blockquote><div><br></div></div><div>No lock=
ing, no.</div>


<div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div><font size=3D"2"><span st=
yle=3D"font-weight:normal"><span style=3D"border-collapse:collapse;color:rg=
b(80, 0, 80);font-family:arial, sans-serif;font-size:13px"><span style=3D"c=
olor:rgb(0, 0, 0)"><div>


2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts. Con=
current updates on exactly the same piece of data on different nodes can st=
ill mess each other up, right ?</div></span></span></span></font></div>


</blockquote><div><br></div></div><div>Not sure why you are taking CL.ALL s=
pecifically. But in any CL, updating the same piece of data means the same =
column value. In that case, the resolution rules are the following:</div>


<div>
=A0=A0- If the updates have a different timestamp, keep the one with the hi=
gher timestamp. That is, the more recent of two updates win.</div><div>=A0=
=A0- It the timestamps are the same, then it compares the values (byte comp=
arison) and keep the highest value. This is just to break ties in a consist=
ent manner.</div>


<div><br></div><div>So if you do two truly concurrent updates (that is from=
 two place at the same instant), then you&#39;ll end with one of the update=
. This is the column level.</div><div><br></div><div>However, if that simpl=
e conflict detection/resolution mechanism is not good enough for some of yo=
ur use case and you need to keep two concurrent updates, it is easy enough.=
 Just make sure that the update don&#39;t end up in the same column. This i=
s easily achieved by appending some unique identifier to the column name fo=
r instance. And when reading, do a slice and reconcile whatever you get bac=
k with whatever logic make sense. If you do that, congrats, you&#39;ve roug=
hly emulated what vector clocks would do. Btw, no locking or anything neede=
d.</div>


<div><br></div><div>In my experience, for most things the timestamp resolut=
ion is enough. If the same user update twice it&#39;s profile picture on yo=
u web site at the same microsecond, it&#39;s usually fine to end up with on=
e of the two pictures. In the rare case where you need something more speci=
fic, using the cassandra data model usually solves the problem easily. The =
reason for not having vector clocks in Cassandra is that so far, we haven&#=
39;t really found much example where it is no the case.</div>


<div>=A0</div><div>--</div><div>Sylvain</div></div><font color=3D"#888888">=
<br>
</font></blockquote></div><br></div></div></div>
</blockquote></div></div></div><br>
</blockquote></div><br>
</div></div></blockquote></div></div></div><br>
</blockquote></div><br>
</div></div></blockquote></div></div></div><br>
</blockquote></div><br></div></div></div>
</blockquote></div></div></div><br>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--000e0cd70e84d7612c049d0bae88--