Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of tijoriwala.ritesh@gmail.com
 designates 209.85.160.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=LHRMCvEX8OTi+VOAGXxN6eW+JFfFMaTHNPlTkuMJLaQsPdmG2p9NNscPnzBhoyZU35
         4DGH1LR0XWL2gUtqJ1K93TcpNDFx5ZeZ34wgYtc0PDB2SBHs/ttkvVjDatDqC6ndha2r
         nIcHuTQJK/q2Wbh9+4jLMtSKxnvA+spfXeI4A=
MIME-Version: 1.0
In-Reply-To: <AANLkTikMR_6MwDPm=VLapm42X7+OVfa35ZzhwML56NAY@mail.gmail.com>
References: <16920482.109351.1298484740112.JavaMail.nabble@jim.nabble.com>
	<AANLkTim2doFji6ur8f98xb2pd0-r9pUPJ4gRHrAqv8fa@mail.gmail.com>
	<AANLkTinyXb_1Xjggj3AkdFj0ctB5tN1ReXEUgrVYUfBO@mail.gmail.com>
	<AANLkTikMR_6MwDPm=VLapm42X7+OVfa35ZzhwML56NAY@mail.gmail.com>
Date: Wed, 23 Feb 2011 14:40:26 -0800
Message-ID: <AANLkTi=-+NygtExB+dPX-vckkwCNj-ydjt0GsA0v631X@mail.gmail.com>
Subject: Re: How does Cassandra handle failure during synchronous writes
From: Ritesh Tijoriwala <tijoriwala.ritesh@gmail.com>
To: user@cassandra.apache.org
Cc: Anthony John <chirayithaj@gmail.com>
Content-Type: multipart/alternative; boundary=0015174becf27ad028049cfac8f5

--0015174becf27ad028049cfac8f5
Content-Type: text/plain; charset=ISO-8859-1

hi Anthony,
While you stated the facts right, I don't see how it relates to the question
I ask. Can you elaborate specifically what happens in the case I mentioned
above to Dave?

thanks,
Ritesh

On Wed, Feb 23, 2011 at 1:57 PM, Anthony John <chirayithaj@gmail.com> wrote:

> Seems to me that the explanations are getting incredibly complicated -
> while I submit the real issue is not!
>
> Salient points here:-
> 1. To be guaranteed data consistency - the writes and reads have to be at
> Quorum CL or more
> 2. Any W/R at lesser CL means that the application has to handle the
> inconsistency, or has to be tolerant of it
> 3. Writing at "ANY" CL - a special case - means that writes will always go
> through (as long as any node is up), even if the destination nodes are not
> up. This is done via hinted handoff. But this can result in inconsistent
> reads, and yes that is a problem but refer to pt-2 above
> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
> handle that case where a particular node is down and the write needs to be
> replicated to it. But this will not cause inconsistent R as the hinted
> handoff (in this case) only applies after Quorum is met - so a Quorum R is
> not dependent on the down node being up, and having got the hint.
>
> Hope I state this appropriately!
>
> HTH,
>
> -JA
>
>
> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
> tijoriwala.ritesh@gmail.com> wrote:
>
>> > Read repair will probably occur at that point (depending on your
>> config), which would cause the newest value to propagate to more replicas.
>>
>> Is the newest value the "quorum" value which means it is the old value
>> that will be written back to the nodes having "newer non-quorum" value or
>> the newest value is the real new value? :) If later, than this seems kind of
>> odd to me and how it will be useful to any application. A bug?
>>
>> Thanks,
>> Ritesh
>>
>>
>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <dave@meebo-inc.com> wrote:
>>
>>> Ritesh,
>>>
>>> You have seen the problem. Clients may read the newly written value even
>>> though the client performing the write saw it as a failure. When the client
>>> reads, it will use the correct number of replicas for the chosen CL, then
>>> return the newest value seen at any replica. This "newest value" could be
>>> the result of a failed write.
>>>
>>> Read repair will probably occur at that point (depending on your config),
>>> which would cause the newest value to propagate to more replicas.
>>>
>>> R+W>N guarantees serial order of operations: any read at CL=R that occurs
>>> after a write at CL=W will observe the write. I don't think this property is
>>> relevant to your current question, though.
>>>
>>> Cassandra has no mechanism to "roll back" the partial write, other than
>>> to simply write again. This may also fail.
>>>
>>> Best,
>>> Dave
>>>
>>>
>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.ritesh@gmail.com> wrote:
>>>
>>>> Hi Dave,
>>>> Thanks for your input. In the steps you mention, what happens when
>>>> client tries to read the value at step 6? Is it possible that the client may
>>>> see the new value? My understanding was if R + W > N, then client will not
>>>> see the new value as Quorum nodes will not agree on the new value. If that
>>>> is the case, then its alright to return failure to the client. However, if
>>>> not, then it is difficult to program as after every failure, you as an
>>>> client are not sure if failure is a pseudo failure with some side effects or
>>>> real failure.
>>>>
>>>> Thanks,
>>>> Ritesh
>>>>
>>>> <quote author='Dave Revell'>
>>>>
>>>> Ritesh,
>>>>
>>>> There is no commit protocol. Writes may be persisted on some replicas
>>>> even
>>>> though the quorum fails. Here's a sequence of events that shows the
>>>> "problem:"
>>>>
>>>> 1. Some replica R fails, but recently, so its failure has not yet been
>>>> detected
>>>> 2. A client writes with consistency > 1
>>>> 3. The write goes to all replicas, all replicas except R persist the
>>>> write
>>>> to disk
>>>> 4. Replica R never responds
>>>> 5. Failure is returned to the client, but the new value is still in the
>>>> cluster, on all replicas except R.
>>>>
>>>> Something very similar could happen for CL QUORUM.
>>>>
>>>> This is a conscious design decision because a commit protocol would
>>>> constitute tight coupling between nodes, which goes against the
>>>> Cassandra
>>>> philosophy. But unfortunately you do have to write your app with this
>>>> case
>>>> in mind.
>>>>
>>>> Best,
>>>> Dave
>>>>
>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>>> tijoriwala.ritesh@gmail.com> wrote:
>>>>
>>>> >
>>>> > Hi,
>>>> > I wanted to get details on how does cassandra do synchronous writes to
>>>> W
>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>>>> > failures of of nodes before it gets to write to W replicas? If the
>>>> > orchestrating node cannot write to W nodes successfully, I guess it
>>>> will
>>>> > fail the write operation but what happens to the completed writes on X
>>>> (W
>>>> > >
>>>> > X) nodes?
>>>> >
>>>> > Thanks,
>>>> > Ritesh
>>>> > --
>>>> > View this message in context:
>>>> >
>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>>> > Sent from the cassandra-user@incubator.apache.org mailing list
>>>> archive at
>>>> > Nabble.com.
>>>> >
>>>>
>>>> </quote>
>>>> Quoted from:
>>>>
>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>>>
>>>
>>>
>>
>

--0015174becf27ad028049cfac8f5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

hi Anthony,<div>While you stated the facts right, I don&#39;t see how it re=
lates to the question I ask. Can you elaborate specifically what happens in=
 the case I mentioned above to Dave?</div><div><br></div><div>thanks,</div>
<div>Ritesh<br><br><div class=3D"gmail_quote">On Wed, Feb 23, 2011 at 1:57 =
PM, Anthony John <span dir=3D"ltr">&lt;<a href=3D"mailto:chirayithaj@gmail.=
com">chirayithaj@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex;">
Seems to me that the explanations are getting incredibly complicated - whil=
e I submit the real issue is not!<div><br></div><div>Salient points here:-<=
/div><div>1. To be guaranteed data consistency - the writes and reads have =
to be at Quorum CL or more</div>

<div>2. Any W/R at lesser CL means that the application has to handle the i=
nconsistency, or has to be tolerant of it</div><div>3. Writing at &quot;ANY=
&quot; CL - a special case - means that writes will always go through (as l=
ong as any node is up), even if the destination nodes are not up. This is d=
one via hinted handoff. But this can result in inconsistent reads, and yes =
that is a problem but refer to pt-2 above=A0</div>

<div>4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to=
 handle that case where a particular node is down and the write needs to be=
 replicated to it. But this will not cause inconsistent R as the hinted han=
doff (in this case) only applies after Quorum is met - so a Quorum R is not=
 dependent on the down node being up, and having got the hint.</div>

<div><br></div><div>Hope I state this appropriately!</div><div><br></div><d=
iv>HTH,</div><div><br></div><div>-JA<div><div></div><div class=3D"h5"><br><=
br><div class=3D"gmail_quote">On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijor=
iwala <span dir=3D"ltr">&lt;<a href=3D"mailto:tijoriwala.ritesh@gmail.com" =
target=3D"_blank">tijoriwala.ritesh@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>&gt; Read repair will probably occur at=
 that point (depending on your config), which would cause the newest value =
to propagate to more replicas.<div>

<br></div></div><div>Is the newest value the &quot;quorum&quot; value which=
 means it is the old value that will be written back to the nodes having &q=
uot;newer non-quorum&quot; value or the newest value is the real new value?=
 :) If later, than this seems kind of odd to me and how it will be useful t=
o any application. A bug?</div>


<div><br></div><div>Thanks,</div><div>Ritesh<div><div></div><div><br><br><d=
iv class=3D"gmail_quote">On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:dave@meebo-inc.com" target=3D"_blank">d=
ave@meebo-inc.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Ritesh,<div><br></div><div>You have seen the=
 problem. Clients may read the newly written value even though the client p=
erforming the write saw it as a failure. When the client reads, it will use=
 the correct number of replicas for the chosen CL, then return the newest v=
alue seen at any replica. This &quot;newest value&quot; could be the result=
 of a failed write.</div>


<div><br></div><div>Read repair will probably occur at that point (dependin=
g on your config), which would cause the newest value to propagate to more =
replicas.</div><div><br></div><div>R+W&gt;N guarantees serial order of oper=
ations: any read at CL=3DR that occurs after a write at CL=3DW will observe=
 the write. I don&#39;t think this property is relevant to your current que=
stion, though.</div>


<div><br></div><div>Cassandra has no mechanism to &quot;roll back&quot; the=
 partial write, other than to simply write again. This may also fail.</div>=
<div>
<br></div><div>Best,</div><div>Dave</div><div><br></div><div><br><div class=
=3D"gmail_quote">On Wed, Feb 23, 2011 at 10:12 AM,  <span dir=3D"ltr">&lt;<=
a href=3D"mailto:tijoriwala.ritesh@gmail.com" target=3D"_blank">tijoriwala.=
ritesh@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Dave,<br>
Thanks for your input. In the steps you mention, what happens when client t=
ries to read the value at step 6? Is it possible that the client may see th=
e new value? My understanding was if R + W &gt; N, then client will not see=
 the new value as Quorum nodes will not agree on the new value. If that is =
the case, then its alright to return failure to the client. However, if not=
, then it is difficult to program as after every failure, you as an client =
are not sure if failure is a pseudo failure with some side effects or real =
failure.<br>


<br>
Thanks,<br>
Ritesh<br>
<br>
&lt;quote author=3D&#39;Dave Revell&#39;&gt;<div><div></div><div><br>
<div><div></div><div>Ritesh,<br>
<br>
There is no commit protocol. Writes may be persisted on some replicas even<=
br>
though the quorum fails. Here&#39;s a sequence of events that shows the<br>
&quot;problem:&quot;<br>
<br>
1. Some replica R fails, but recently, so its failure has not yet been<br>
detected<br>
2. A client writes with consistency &gt; 1<br>
3. The write goes to all replicas, all replicas except R persist the write<=
br>
to disk<br>
4. Replica R never responds<br>
5. Failure is returned to the client, but the new value is still in the<br>
cluster, on all replicas except R.<br>
<br>
Something very similar could happen for CL QUORUM.<br>
<br>
This is a conscious design decision because a commit protocol would<br>
constitute tight coupling between nodes, which goes against the Cassandra<b=
r>
philosophy. But unfortunately you do have to write your app with this case<=
br>
in mind.<br>
<br>
Best,<br>
Dave<br>
<br>
On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh &lt;<br>
<a href=3D"mailto:tijoriwala.ritesh@gmail.com" target=3D"_blank">tijoriwala=
.ritesh@gmail.com</a>&gt; wrote:<br>
<br>
&gt;<br>
&gt; Hi,<br>
&gt; I wanted to get details on how does cassandra do synchronous writes to=
 W<br>
&gt; replicas (out of N)? Does it do a 2PC? If not, how does it deal with<b=
r>
&gt; failures of of nodes before it gets to write to W replicas? If the<br>
&gt; orchestrating node cannot write to W nodes successfully, I guess it wi=
ll<br>
&gt; fail the write operation but what happens to the completed writes on X=
 (W<br>
&gt; &gt;<br>
&gt; X) nodes?<br>
&gt;<br>
&gt; Thanks,<br>
&gt; Ritesh<br>
&gt; --<br>
&gt; View this message in context:<br>
&gt; <a href=3D"http://cassandra-user-incubator-apache-org.3065146.n2.nabbl=
e.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152=
p6055152.html" target=3D"_blank">http://cassandra-user-incubator-apache-org=
.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous=
-writes-tp6055152p6055152.html</a><br>


&gt; Sent from the <a href=3D"mailto:cassandra-user@incubator.apache.org" t=
arget=3D"_blank">cassandra-user@incubator.apache.org</a> mailing list archi=
ve at<br>
&gt; Nabble.com.<br>
&gt;<br>
<br>
</div></div></div></div>&lt;/quote&gt;<br>
Quoted from:<br>
<a href=3D"http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com=
/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055=
408.html" target=3D"_blank">http://cassandra-user-incubator-apache-org.3065=
146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writ=
es-tp6055152p6055408.html</a><br>


</blockquote></div><br></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--0015174becf27ad028049cfac8f5--