Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of chirayithaj@gmail.com
 designates 209.85.161.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=nLZQCfLQsa+B0lZEzcXVn343lKGzBuAOc12LsAb1/oGo1brRtq/WCu+MC5PWgY6Mu1
         +nnDLYkekGPFDBP320VruBkKt38EO1B8e6vAOe8WZz/S0r/CPCuglEiaJ7wi64K/oL8s
         3U0VARUQDD2oPKs4QDUreoyKhgMGohMqO9IJ0=
MIME-Version: 1.0
In-Reply-To: <AANLkTim5b3wYa5K4isB1ihczeNvO+H3zyU-8oKVJX48k@mail.gmail.com>
References: <AANLkTi=PzLdZZZgoY+vmWkRVJB9A_5WFPMAMb3AHyNs3@mail.gmail.com>
	<AANLkTikHWpYzBX0Z3x0E_1jSZdEpuNO-D3UrYCyXoqz4@mail.gmail.com>
	<AANLkTi=UVWHmxFtq=T48QWZr48B0TaE21KzN-2YpH_vM@mail.gmail.com>
	<AANLkTindPaHaw9kWipeuRNpGmKvW90aJzQTVj+WBkRqn@mail.gmail.com>
	<AANLkTinMg+NkTXcpZ9QW_3UoXdGmWS2F129-+UN_P8DZ@mail.gmail.com>
	<AANLkTimNAkD9Q88U51dH5=BCkK3pwaUZh4fniLydcyHh@mail.gmail.com>
	<AANLkTiky6s02SpsQvBkHaDrjHbQ3b=YvLFOANKKiW5X4@mail.gmail.com>
	<AANLkTi==3sj10StL3pY2MvaMwVn+gwCzvnBXf9HQ9PcK@mail.gmail.com>
	<AANLkTikSshW-nrMCrdUOy=StskmSbwoAopHCR4JQS8QQ@mail.gmail.com>
	<AANLkTi=TRtOuEotCKZ6okf9mF6bDLAn6Wpb4ztru-uuE@mail.gmail.com>
	<AANLkTi=fpNOPTunD2190jHeY144O3we7B20xsfrT_=nY@mail.gmail.com>
	<AANLkTim5b3wYa5K4isB1ihczeNvO+H3zyU-8oKVJX48k@mail.gmail.com>
Date: Thu, 24 Feb 2011 13:47:12 -0600
Message-ID: <AANLkTin6LjgXz+ie4KA7+bXkbD=SKGXV7SpOzdM0eyfk@mail.gmail.com>
Subject: Re: New Chain for : Does Cassandra use vector clocks
From: Anthony John <chirayithaj@gmail.com>
To: A J <s5alye@gmail.com>
Cc: user@cassandra.apache.org, Sylvain Lebresne <sylvain@datastax.com>
Content-Type: multipart/alternative; boundary=20cf30433ed2ce30e0049d0c7ae4

--20cf30433ed2ce30e0049d0c7ae4
Content-Type: text/plain; charset=ISO-8859-1

The leap of faith here is that an error does not mean a clean backing out to
prior state - as we are used to with databases. It means that the operation
in error could have gone through partially

Again, this is not an absolutely unfamiliar territory and can be dealt with.

-JA

On Thu, Feb 24, 2011 at 1:16 PM, A J <s5alye@gmail.com> wrote:

> >>but could be broken in case of a failed write<<
> You can think of a scenario where R + W >N still leads to
> inconsistency even for successful writes. Say you keep W=1 and R=N .
> Lets say the one node where a write happened with success goes down
> before it made to the other N-1 nodes. Lets say it goes down for good
> and is unrecoverable. The only option is to build a new node from
> scratch from other active nodes. This will lead to a write that was
> lost and you will end up serving stale copy of it.
>
> It is better to talk in terms of use cases and if cassandra will be a
> fit for it. Otherwise unless you have W=R=N and fsync before each
> write commit, there will be scope for inconsistency.
>
>
> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayithaj@gmail.com>
> wrote:
> > I see the point - apologies for putting everyone through this!
> > It was just militating against my mental model.
> > In summary, here is my take away - simple stuff but - IMO - important to
> > conclude this thread (I hope):-
> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> > should be immediately followed by the same write going to a connection on
> to
> > another node ( potentially using connection caches of client
> implementations
> > ) or a Read at CL of All. Because a write could have partially gone
> through.
> > 2. Timestamps are used in determining the latest version ( correcting the
> > false impression I was propagating)
> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> > case of a failed write as it is unsure whether the new value got written
> on
> >  any server or not. Is that a fair characterization ?
> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
> > cleanup and revert back, app code has to follow up if  immediate - and
> not
> > eventual -  consistency is desired. I made that leap in almost all cases
> - I
> > think - but the case of a failed write.
> > My bad and I can live with this!
> > Regards,
> > -JA
> >
> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sylvain@datastax.com
> >
> > wrote:
> >>
> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayithaj@gmail.com>
> >> wrote:
> >>>
> >>> Completely understand!
> >>> All that I am quibbling over is whether a CL of quorum guarantees
> >>> consistency or not. That is what the documentation says - right. IF for
> a CL
> >>> of Q read - it depends on which node returns read first to determine
> the
> >>> actual returned result or other more convoluted conditions , then a
> Quorum
> >>> read/write is not consistent, by any definition.
> >>
> >> But that's the point. The definition of consistency we are talking about
> >> has no meaning if you consider only a quorum read. The definition (which
> is
> >> the de facto definition of consistency in 'eventually consistent') make
> >> sense if we talk about a write followed by a read. And it is
> >> considering succeeding write followed by succeeding read.
> >> And that is the statement the wiki is making.
> >> Honestly, we could debate forever on the definition of consistency and
> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
> is
> >> guaranteed that the read will see the preceding write. And this is what
> is
> >> called consistency in the context of eventual consistency (which is not
> the
> >> context of ACID).
> >> If this is not the definition of consistency you had in mind then by all
> >> mean, Cassandra probably don't guarantee this definition. But given that
> the
> >> paragraph preceding what you pasted state clearly we are not talking
> about
> >> ACID consistency, but eventual consistency, I don't think the wiki is
> making
> >> any unfair statement.
> >> That being said, the wiki may not be always as clear as it could. But
> it's
> >> an editable wiki :)
> >> --
> >> Sylvain
> >>
> >>>
> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
> make
> >>> this statement on the Wiki architecture section:-
> >>> -------------------------------------------------------------
> >>>
> >>> More specifically: R=read replica count W=write replica
> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
> >>>
> >>> If W + R > N, you will have consistency
> >>>
> >>> W=1, R=N
> >>> W=N, R=1
> >>> W=Q, R=Q where Q = N / 2 + 1
> >>>
> >>> Cassandra provides consistency when R + W > N (read replica count
> + write
> >>> replica count > replication factor).
> >>>
> >>> ----------------------------------------------------
> >>>
> >>> .
> >>>
> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <
> sylvain@datastax.com>
> >>> wrote:
> >>>>
> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayithaj@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> If you are correct and you are probably closer to the code - then CL
> of
> >>>>> Quorum does not guarantee a consistency.
> >>>>
> >>>> If the operation succeed, it does (for some definition of consistency
> >>>> which is, following reads at Quorum will be guaranteed to see the new
> value
> >>>> of a update at quorum). If it fails, then no, it does not guarantee
> >>>> consistency.
> >>>> It is important to note that the word consistency has multiple
> meaning.
> >>>> In particular, when we are talking of consistency in Cassandra, we are
> not
> >>>> talking of the same definition as the C in ACID
> >>>> (see:
> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
> >>>>>
> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
> >>>>> <sylvain@datastax.com> wrote:
> >>>>>>
> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <
> chirayithaj@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
> >>>>>>>> >> part of the application logic!!!
> >>>>>>>
> >>>>>>> >>What is you definition of conflict resolution ? Because if you
> >>>>>>> >> update twice the same column (which
> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
> which
> >>>>>>> >> update wins (which I'll call a resolution).
> >>>>>>> I understand what you are saying, and yes semantics is very
> important
> >>>>>>> here. And yes we are responding to the immediate questions without
> covering
> >>>>>>> all questions in the thread.
> >>>>>>> The point being made here is that the timestamp of the column is
> not
> >>>>>>> used by Cassandra to figure out what data to return.
> >>>>>>
> >>>>>> Not quite true.
> >>>>>>>
> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
> write is
> >>>>>>> returned as failed - right ?
> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
> the
> >>>>>>> write failed for.
> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
> >>>>>>> I submit it will return TS1 - the old TS.
> >>>>>>
> >>>>>> It all depends on which (first 2) nodes respond to the read (since
> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
> makes the
> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
> the
> >>>>>> timestamp and decide what to return based on this. If N2/N3 responds
> >>>>>> however, both timestamp will be TS1 and so, after timestamp
> resolution, it
> >>>>>> will stil be TS1 that will be returned.
> >>>>>> So yes timestamp is used for conflict resolution.
> >>>>>> In your example, you could get TS1 back because a failed write can
> let
> >>>>>> you cluster in an inconsistent state. You'd have to retry the quorum
> and
> >>>>>> only when it succeeds can you be guaranteed that quorum read will
> always
> >>>>>> return TS2.
> >>>>>> This is because when a write fails, Cassandra doesn't guarantee that
> >>>>>> the write did not made it in (there is no revert).
> >>>>>>
> >>>>>>>
> >>>>>>> Are we on the same page with this interpretation ?
> >>>>>>> Regards,
> >>>>>>> -JA
> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
> >>>>>>> <sylvain@datastax.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
> >>>>>>>> <chirayithaj@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Sylvan,
> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
> >>>>>>>>> part of the application logic!!!
> >>>>>>>>
> >>>>>>>> What is you definition of conflict resolution ? Because if you
> >>>>>>>> update twice the same column (which
> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
> which
> >>>>>>>> update wins (which I'll call a resolution).
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
> >>>>>>>>
> >>>>>>>> Then again, you'll have to define what you are calling "lost
> >>>>>>>> updates". Provided you use a reasonable consistency level,
> Cassandra
> >>>>>>>> provides fairly strong durability guarantee, so for some
> definition you
> >>>>>>>> don't "lose updates".
> >>>>>>>> That being said, I never pretended that Cassandra provided any
> ACID
> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
> support. If
> >>>>>>>> we're talking about the guarantees of transaction, then by all
> means,
> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
> to get
> >>>>>>>> transaction. But that was not the point of the thread, was it ?
> The thread
> >>>>>>>> is about vector clocks, and that has nothing to do with
> transaction (vector
> >>>>>>>> clocks certainly don't give you transactions).
> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
> why
> >>>>>>>> so far I don't think vector clocks would really provide much for
> Cassandra.
> >>>>>>>> --
> >>>>>>>> Sylvain
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -JA
> >>>>>>>>>
> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
> >>>>>>>>> <sylvain@datastax.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
> >>>>>>>>>> <chirayithaj@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Apologies : For some reason my response on the original mail
> >>>>>>>>>>> keeps bouncing back, thus this new one!
> >>>>>>>>>>>
> >>>>>>>>>>> > From the other hand, the same article says:
> >>>>>>>>>>> > "For conditional writes to work, the condition must be
> >>>>>>>>>>> > evaluated at all update
> >>>>>>>>>>> > sites before the write can be allowed to succeed."
> >>>>>>>>>>> >
> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
> used
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
> >>>>>>>>>>> Questions:-
> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
> >>>>>>>>>>> granularity whether it be row/colF/Col ?
> >>>>>>>>>>
> >>>>>>>>>> No locking, no.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data
> on different
> >>>>>>>>>>> nodes can still mess each other up, right ?
> >>>>>>>>>>
> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
> >>>>>>>>>> updating the same piece of data means the same column value. In
> that case,
> >>>>>>>>>> the resolution rules are the following:
> >>>>>>>>>>   - If the updates have a different timestamp, keep the one with
> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
> win.
> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
> break ties in
> >>>>>>>>>> a consistent manner.
> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
> place
> >>>>>>>>>> at the same instant), then you'll end with one of the update.
> This is the
> >>>>>>>>>> column level.
> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
> is
> >>>>>>>>>> not good enough for some of your use case and you need to keep
> two
> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
> update don't
> >>>>>>>>>> end up in the same column. This is easily achieved by appending
> some unique
> >>>>>>>>>> identifier to the column name for instance. And when reading, do
> a slice and
> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
> If you do
> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would
> do. Btw, no
> >>>>>>>>>> locking or anything needed.
> >>>>>>>>>> In my experience, for most things the timestamp resolution is
> >>>>>>>>>> enough. If the same user update twice it's profile picture on
> you web site
> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one of
> the two
> >>>>>>>>>> pictures. In the rare case where you need something more
> specific, using the
> >>>>>>>>>> cassandra data model usually solves the problem easily. The
> reason for not
> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
> really found
> >>>>>>>>>> much example where it is no the case.
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Sylvain
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>

--20cf30433ed2ce30e0049d0c7ae4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The leap of faith here is that an error does not mean a clean backing out t=
o prior state - as we are used to with databases. It means that the operati=
on in error could have gone through partially<br><br><div>Again, this is no=
t an absolutely unfamiliar territory and can be dealt with.</div>
<div><br></div><div>-JA</div><div><br><div class=3D"gmail_quote">On Thu, Fe=
b 24, 2011 at 1:16 PM, A J <span dir=3D"ltr">&lt;<a href=3D"mailto:s5alye@g=
mail.com">s5alye@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex;">
<div class=3D"im">&gt;&gt;but could be broken in case of a failed write&lt;=
&lt;<br>
</div>You can think of a scenario where R + W &gt;N still leads to<br>
inconsistency even for successful writes. Say you keep W=3D1 and R=3DN .<br=
>
Lets say the one node where a write happened with success goes down<br>
before it made to the other N-1 nodes. Lets say it goes down for good<br>
and is unrecoverable. The only option is to build a new node from<br>
scratch from other active nodes. This will lead to a write that was<br>
lost and you will end up serving stale copy of it.<br>
<br>
It is better to talk in terms of use cases and if cassandra will be a<br>
fit for it. Otherwise unless you have W=3DR=3DN and fsync before each<br>
write commit, there will be scope for inconsistency.<br>
<div><div></div><div class=3D"h5"><br>
<br>
On Thu, Feb 24, 2011 at 1:25 PM, Anthony John &lt;<a href=3D"mailto:chirayi=
thaj@gmail.com">chirayithaj@gmail.com</a>&gt; wrote:<br>
&gt; I see the point - apologies for putting everyone through this!<br>
&gt; It was just militating against my mental model.<br>
&gt; In summary, here is my take away - simple stuff but - IMO - important =
to<br>
&gt; conclude this thread (I hope):-<br>
&gt; 1. I was splitting hair over a failed ( partial ) Q Write. Such an eve=
nt<br>
&gt; should be immediately followed by the same write going to a connection=
 on to<br>
&gt; another node ( potentially using connection caches of client implement=
ations<br>
&gt; ) or a Read at CL of All. Because a write could have partially gone th=
rough.<br>
&gt; 2. Timestamps are used in determining the latest version ( correcting =
the<br>
&gt; false impression I was propagating)<br>
&gt; Finally, wrt &quot;W + R &gt; N for Q CL statement&quot; holds, but co=
uld be broken in<br>
&gt; case of a failed write as it is unsure whether the new value got writt=
en on<br>
&gt; =A0any server or not. Is that a fair=A0characterization=A0?<br>
&gt; Bottom line - unlike traditional DBMS, errors do not ensure automatic<=
br>
&gt; cleanup and revert back, app code has to follow up if =A0immediate - a=
nd not<br>
&gt; eventual - =A0consistency is desired. I made that leap in almost all c=
ases - I<br>
&gt; think - but the case of a failed write.<br>
&gt; My bad and=A0I can live with this!<br>
&gt; Regards,<br>
&gt; -JA<br>
&gt;<br>
&gt; On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne &lt;<a href=3D"mail=
to:sylvain@datastax.com">sylvain@datastax.com</a>&gt;<br>
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Feb 24, 2011 at 6:33 PM, Anthony John &lt;<a href=3D"mailt=
o:chirayithaj@gmail.com">chirayithaj@gmail.com</a>&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Completely understand!<br>
&gt;&gt;&gt; All that I am quibbling over is whether a CL of quorum guarant=
ees<br>
&gt;&gt;&gt; consistency or not. That is what the documentation says - righ=
t. IF for a CL<br>
&gt;&gt;&gt; of Q read - it depends on which node returns read first to det=
ermine the<br>
&gt;&gt;&gt; actual returned result or other more convoluted conditions , t=
hen a Quorum<br>
&gt;&gt;&gt; read/write is not consistent, by any definition.<br>
&gt;&gt;<br>
&gt;&gt; But that&#39;s the point. The definition of consistency we are tal=
king about<br>
&gt;&gt; has no meaning if you consider only a quorum read. The definition =
(which is<br>
&gt;&gt; the de facto definition of consistency in &#39;eventually consiste=
nt&#39;) make<br>
&gt;&gt; sense if we talk about a write followed by a read. And it is<br>
&gt;&gt; considering=A0succeeding=A0write followed by succeeding read.<br>
&gt;&gt; And that is the statement the wiki is making.<br>
&gt;&gt; Honestly, we could debate forever on the definition of consistency=
 and<br>
&gt;&gt; whatnot. Cassandra guaranties that if you do a (succeeding) write =
on W<br>
&gt;&gt; replica and then a (succeeding) read on R replica and if R+W&gt;N,=
 then it is<br>
&gt;&gt; guaranteed that the read will see the preceding write. And this is=
 what is<br>
&gt;&gt; called consistency in the context of eventual consistency (which i=
s not the<br>
&gt;&gt; context of ACID).<br>
&gt;&gt; If this is not the definition of consistency you had in mind then =
by all<br>
&gt;&gt; mean, Cassandra probably don&#39;t guarantee this definition. But =
given that the<br>
&gt;&gt; paragraph preceding what you pasted state clearly we are not talki=
ng about<br>
&gt;&gt; ACID consistency, but eventual consistency, I don&#39;t think the =
wiki is making<br>
&gt;&gt; any unfair statement.<br>
&gt;&gt; That being said, the wiki may not be always as clear as it could. =
But it&#39;s<br>
&gt;&gt; an editable wiki :)<br>
&gt;&gt; --<br>
&gt;&gt; Sylvain<br>
&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; I can still use Cassandra, and will use it, luv it!!! But let =
us not make<br>
&gt;&gt;&gt; this statement on the Wiki architecture section:-<br>
&gt;&gt;&gt; -------------------------------------------------------------<=
br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; More specifically:=A0R=3Dread replica count=A0W=3Dwrite replic=
a<br>
&gt;&gt;&gt; count=A0N=3Dreplication factor=A0Q=3DQUORUM=A0(Q =3D N / 2 + 1=
)<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; If W + R &gt; N, you will have consistency<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; W=3D1, R=3DN<br>
&gt;&gt;&gt; W=3DN, R=3D1<br>
&gt;&gt;&gt; W=3DQ, R=3DQ where Q =3D N / 2 + 1<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Cassandra provides consistency when R + W &gt; N (read replica=
 count +=A0write<br>
&gt;&gt;&gt; replica count &gt; replication factor).<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; ----------------------------------------------------<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; .<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne &lt;<a href=
=3D"mailto:sylvain@datastax.com">sylvain@datastax.com</a>&gt;<br>
&gt;&gt;&gt; wrote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 6:01 PM, Anthony John &lt;<a href=
=3D"mailto:chirayithaj@gmail.com">chirayithaj@gmail.com</a>&gt;<br>
&gt;&gt;&gt;&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; If you are correct and you are probably closer to the =
code - then CL of<br>
&gt;&gt;&gt;&gt;&gt; Quorum does not guarantee a consistency.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; If the operation succeed, it does (for some definition of =
consistency<br>
&gt;&gt;&gt;&gt; which is, following reads at Quorum will be guaranteed to =
see the new value<br>
&gt;&gt;&gt;&gt; of a update at quorum). If it fails, then no, it does not =
guarantee<br>
&gt;&gt;&gt;&gt; consistency.<br>
&gt;&gt;&gt;&gt; It is important to note that the word consistency has mult=
iple meaning.<br>
&gt;&gt;&gt;&gt; In particular, when we are talking of consistency in Cassa=
ndra, we are not<br>
&gt;&gt;&gt;&gt; talking of the same definition as the C in ACID<br>
&gt;&gt;&gt;&gt; (see:=A0<a href=3D"http://www.allthingsdistributed.com/200=
7/12/eventually_consistent.html" target=3D"_blank">http://www.allthingsdist=
ributed.com/2007/12/eventually_consistent.html</a>)<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne<br>
&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:sylvain@datastax.com">sylvain@da=
tastax.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 5:34 PM, Anthony John &lt;=
<a href=3D"mailto:chirayithaj@gmail.com">chirayithaj@gmail.com</a>&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;Time stamps are not used for confl=
ict resolution - unless is is<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; part of the application logic!!!<=
br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;What is you definition of conflict res=
olution ? Because if you<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; update twice the same column (which<b=
r>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;I&#39;ll call a conflict), then the ti=
mestamps are used to decide which<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; update wins (which I&#39;ll call a re=
solution).<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I understand what you are saying, and yes sema=
ntics is very important<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; here. And yes we are responding to the immedia=
te questions without covering<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; all questions in the thread.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; The point being made here is that the timestam=
p of the column is not<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; used by Cassandra to figure out what data to r=
eturn.<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Not quite true.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; E.g. - Quorum is 2 nodes - and RF of 3 over N1=
/2/3<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; A Quorum =A0Write comes and add/updates the ti=
me stamp (TS2) of a<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; particular data element. It succeeds on N1 - f=
ails on N2/3. So the write is<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; returned as failed - right ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Now Quorum read comes in for exactly the same =
piece of data that the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; write failed for.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; So N1 has TS2 but both N2/3 have the old TS (s=
ay TS1)<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; And the read succeeds - Will it return TS1 or =
TS2.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I submit it will return TS1 - the old TS.<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; It all depends on which (first 2) nodes respond to=
 the read (since<br>
&gt;&gt;&gt;&gt;&gt;&gt; RF=3D3, that can any two of N1/N2/N3). If N1 is pa=
rt of the two that makes the<br>
&gt;&gt;&gt;&gt;&gt;&gt; quorum, then TS2 will be returned, because cassand=
ra will compare the<br>
&gt;&gt;&gt;&gt;&gt;&gt; timestamp and decide what to return based on this.=
 If N2/N3 responds<br>
&gt;&gt;&gt;&gt;&gt;&gt; however, both timestamp will be TS1 and so, after =
timestamp resolution, it<br>
&gt;&gt;&gt;&gt;&gt;&gt; will stil be TS1 that will be returned.<br>
&gt;&gt;&gt;&gt;&gt;&gt; So yes timestamp is used for conflict resolution.<=
br>
&gt;&gt;&gt;&gt;&gt;&gt; In your example, you could get TS1 back because a =
failed write can let<br>
&gt;&gt;&gt;&gt;&gt;&gt; you cluster in an inconsistent state. You&#39;d ha=
ve to retry the quorum and<br>
&gt;&gt;&gt;&gt;&gt;&gt; only when it succeeds can you be guaranteed that q=
uorum read will always<br>
&gt;&gt;&gt;&gt;&gt;&gt; return TS2.<br>
&gt;&gt;&gt;&gt;&gt;&gt; This is because when a write fails, Cassandra does=
n&#39;t guarantee that<br>
&gt;&gt;&gt;&gt;&gt;&gt; the write did not made it in (there is no revert).=
<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Are we on the same page with this interpretati=
on ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Regards,<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; -JA<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebr=
esne<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:sylvain@datastax.com">sy=
lvain@datastax.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 4:52 PM, Anthony J=
ohn<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:chirayithaj@gmail.co=
m">chirayithaj@gmail.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Sylvan,<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Time stamps are not used for conflict =
resolution - unless is is<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; part of the application logic!!!<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; What is you definition of conflict resolut=
ion ? Because if you<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; update twice the same column (which<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I&#39;ll call a conflict), then the timest=
amps are used to decide which<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; update wins (which I&#39;ll call a resolut=
ion).<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; You can have &quot;lost updates&quot; =
w/Cassandra. You need to to use 3rd<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; products - cages for e.g. - to get ACI=
D type consistency.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Then again, you&#39;ll have to define what=
 you are calling &quot;lost<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; updates&quot;. Provided you use a reasonab=
le consistency level, Cassandra<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; provides fairly strong durability guarante=
e, so for some definition you<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; don&#39;t &quot;lose updates&quot;.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; That being said,=A0I never pretended that =
Cassandra provided any ACID<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; guarantee. ACID relates to transaction, wh=
ich Cassandra doesn&#39;t support. If<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; we&#39;re talking about the guarantees of =
transaction, then by all means,<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; cassandra won&#39;t provide it. And yes yo=
u can use cages or the like to get<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; transaction. But that was not the point of=
 the thread, was it ? The thread<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; is about vector clocks, and that has nothi=
ng to do with transaction (vector<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; clocks certainly don&#39;t give you transa=
ctions).<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Sorry if I wasn&#39;t clear in my mail, bu=
t I was only responding to why<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; so far I don&#39;t think vector clocks wou=
ld really provide much for Cassandra.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; --<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Sylvain<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; -JA<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 7:41 AM, Sylva=
in Lebresne<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:sylvain@datastax=
.com">sylvain@datastax.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On Thu, Feb 24, 2011 at 3:22 AM, A=
nthony John<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:chirayithaj@=
gmail.com">chirayithaj@gmail.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Apologies : For some reason my=
 response on the original mail<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; keeps bouncing back, thus this=
 new one!<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt; From the other hand, the =
same article says:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt; &quot;For conditional wri=
tes to work, the condition must be<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt; evaluated at all update<b=
r>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt; sites before the write ca=
n be allowed to succeed.&quot;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt; This means, that when doi=
ng such an update CL=3DALL must be used<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Sorry, but I am confused by th=
at entire thread!<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Questions:-<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 1. Does Cassandra implement an=
y kind of data locking - at any<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; granularity whether it be row/=
colF/Col ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; No locking, no.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 2. If the answer to 1 above is=
 NO! - how does CL ALL prevent<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; conflicts. Concurrent updates =
on exactly the same piece of data on different<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; nodes can still mess each othe=
r up, right ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Not sure why you are taking CL.ALL=
 specifically. But in any CL,<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; updating the same piece of data me=
ans the same column value. In that case,<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the resolution rules are the follo=
wing:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =A0=A0- If the updates have a diff=
erent timestamp, keep the one with<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the higher timestamp. That is, the=
 more recent of two updates win.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =A0=A0- It the timestamps are the =
same, then it compares the values<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; (byte comparison) and keep the hig=
hest value. This is just to break ties in<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; a consistent manner.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; So if you do two truly concurrent =
updates (that is from two place<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; at the same instant), then you&#39=
;ll end with one of the update. This is the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; column level.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; However, if that simple conflict d=
etection/resolution mechanism is<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not good enough for some of your u=
se case and you need to keep two<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; concurrent updates, it is easy eno=
ugh. Just make sure that the update don&#39;t<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; end up in the same column. This is=
 easily achieved by appending some unique<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; identifier to the column name for =
instance. And when reading, do a slice and<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; reconcile whatever you get back wi=
th whatever logic make sense. If you do<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; that, congrats, you&#39;ve roughly=
 emulated what vector clocks would do. Btw, no<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; locking or anything needed.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; In my experience, for most things =
the timestamp resolution is<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; enough. If the same user update tw=
ice it&#39;s profile picture on you web site<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; at the same microsecond, it&#39;s =
usually fine to end up with one of the two<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; pictures. In the rare case where y=
ou need something more specific, using the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; cassandra data model usually solve=
s the problem easily. The reason for not<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; having vector clocks in Cassandra =
is that so far, we haven&#39;t really found<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; much example where it is no the ca=
se.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; --<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Sylvain<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div>

--20cf30433ed2ce30e0049d0c7ae4--