Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 36797 invoked from network); 23 Feb 2011 20:47:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Feb 2011 20:47:08 -0000 Received: (qmail 35576 invoked by uid 500); 23 Feb 2011 20:47:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 35455 invoked by uid 500); 23 Feb 2011 20:47:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 35447 invoked by uid 99); 23 Feb 2011 20:47:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 20:47:05 +0000 X-ASF-Spam-Status: No, hits=1.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dave@meebo-inc.com designates 74.125.149.71 as permitted sender) Received: from [74.125.149.71] (HELO na3sys009aog103.obsmtp.com) (74.125.149.71) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 20:46:56 +0000 Received: from source ([209.85.212.46]) (using TLSv1) by na3sys009aob103.postini.com ([74.125.148.12]) with SMTP ID DSNKTWVyKgxXO5Se6ZocFUR3w2DGcgOVEFI7@postini.com; Wed, 23 Feb 2011 12:46:35 PST Received: by mail-vw0-f46.google.com with SMTP id 12so3860663vws.5 for ; Wed, 23 Feb 2011 12:46:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.52.168.4 with SMTP id zs4mr6783330vdb.43.1298493829179; Wed, 23 Feb 2011 12:43:49 -0800 (PST) Received: by 10.220.53.141 with HTTP; Wed, 23 Feb 2011 12:43:49 -0800 (PST) In-Reply-To: <16920482.109351.1298484740112.JavaMail.nabble@jim.nabble.com> References: <16920482.109351.1298484740112.JavaMail.nabble@jim.nabble.com> Date: Wed, 23 Feb 2011 12:43:49 -0800 Message-ID: Subject: Re: How does Cassandra handle failure during synchronous writes From: Dave Revell To: tijoriwala.ritesh@gmail.com, user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec53f8f0d6f4abf049cf92709 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec53f8f0d6f4abf049cf92709 Content-Type: text/plain; charset=ISO-8859-1 Ritesh, You have seen the problem. Clients may read the newly written value even though the client performing the write saw it as a failure. When the client reads, it will use the correct number of replicas for the chosen CL, then return the newest value seen at any replica. This "newest value" could be the result of a failed write. Read repair will probably occur at that point (depending on your config), which would cause the newest value to propagate to more replicas. R+W>N guarantees serial order of operations: any read at CL=R that occurs after a write at CL=W will observe the write. I don't think this property is relevant to your current question, though. Cassandra has no mechanism to "roll back" the partial write, other than to simply write again. This may also fail. Best, Dave On Wed, Feb 23, 2011 at 10:12 AM, wrote: > Hi Dave, > Thanks for your input. In the steps you mention, what happens when client > tries to read the value at step 6? Is it possible that the client may see > the new value? My understanding was if R + W > N, then client will not see > the new value as Quorum nodes will not agree on the new value. If that is > the case, then its alright to return failure to the client. However, if not, > then it is difficult to program as after every failure, you as an client are > not sure if failure is a pseudo failure with some side effects or real > failure. > > Thanks, > Ritesh > > > Ritesh, > > There is no commit protocol. Writes may be persisted on some replicas even > though the quorum fails. Here's a sequence of events that shows the > "problem:" > > 1. Some replica R fails, but recently, so its failure has not yet been > detected > 2. A client writes with consistency > 1 > 3. The write goes to all replicas, all replicas except R persist the write > to disk > 4. Replica R never responds > 5. Failure is returned to the client, but the new value is still in the > cluster, on all replicas except R. > > Something very similar could happen for CL QUORUM. > > This is a conscious design decision because a commit protocol would > constitute tight coupling between nodes, which goes against the Cassandra > philosophy. But unfortunately you do have to write your app with this case > in mind. > > Best, > Dave > > On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh < > tijoriwala.ritesh@gmail.com> wrote: > > > > > Hi, > > I wanted to get details on how does cassandra do synchronous writes to W > > replicas (out of N)? Does it do a 2PC? If not, how does it deal with > > failures of of nodes before it gets to write to W replicas? If the > > orchestrating node cannot write to W nodes successfully, I guess it will > > fail the write operation but what happens to the completed writes on X (W > > > > > X) nodes? > > > > Thanks, > > Ritesh > > -- > > View this message in context: > > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html > > Sent from the cassandra-user@incubator.apache.org mailing list archive > at > > Nabble.com. > > > > > Quoted from: > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html > --bcaec53f8f0d6f4abf049cf92709 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ritesh,

You have seen the problem. Clients may read the = newly written value even though the client performing the write saw it as a= failure. When the client reads, it will use the correct number of replicas= for the chosen CL, then return the newest value seen at any replica. This = "newest value" could be the result of a failed write.

Read repair will probably occur at that point (dependin= g on your config), which would cause the newest value to propagate to more = replicas.

R+W>N guarantees serial order of oper= ations: any read at CL=3DR that occurs after a write at CL=3DW will observe= the write. I don't think this property is relevant to your current que= stion, though.

Cassandra has no mechanism to "roll back" the p= artial write, other than to simply write again. This may also fail.

Best,
Dave


On Wed, Feb 23, 2011 at 10:12 AM, <<= a href=3D"mailto:tijoriwala.ritesh@gmail.com">tijoriwala.ritesh@gmail.com> wrote:
Hi Dave,
Thanks for your input. In the steps you mention, what happens when client t= ries to read the value at step 6? Is it possible that the client may see th= e new value? My understanding was if R + W > N, then client will not see= the new value as Quorum nodes will not agree on the new value. If that is = the case, then its alright to return failure to the client. However, if not= , then it is difficult to program as after every failure, you as an client = are not sure if failure is a pseudo failure with some side effects or real = failure.

Thanks,
Ritesh

<quote author=3D'Dave Revell'>
Ritesh,

There is no commit protocol. Writes may be persisted on some replicas even<= br> though the quorum fails. Here's a sequence of events that shows the
"problem:"

1. Some replica R fails, but recently, so its failure has not yet been
detected
2. A client writes with consistency > 1
3. The write goes to all replicas, all replicas except R persist the write<= br> to disk
4. Replica R never responds
5. Failure is returned to the client, but the new value is still in the
cluster, on all replicas except R.

Something very similar could happen for CL QUORUM.

This is a conscious design decision because a commit protocol would
constitute tight coupling between nodes, which goes against the Cassandra philosophy. But unfortunately you do have to write your app with this case<= br> in mind.

Best,
Dave

On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
tijoriwala.ritesh@gmail.com<= /a>> wrote:

>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to= W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with > failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it wi= ll
> fail the write operation but what happens to the completed writes on X= (W
> >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context:
>
http://cassandra-user-incubator-apache-org= .3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous= -writes-tp6055152p6055152.html
> Sent from the c= assandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

</quote>
Quoted from:
http://cassandra-user-incubator-apache-org.3065= 146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writ= es-tp6055152p6055408.html

--bcaec53f8f0d6f4abf049cf92709--