Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 79895 invoked from network); 23 Feb 2011 22:40:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Feb 2011 22:40:54 -0000 Received: (qmail 96930 invoked by uid 500); 23 Feb 2011 22:40:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 96885 invoked by uid 500); 23 Feb 2011 22:40:51 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 96877 invoked by uid 99); 23 Feb 2011 22:40:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 22:40:51 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tijoriwala.ritesh@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 22:40:47 +0000 Received: by gyc15 with SMTP id 15so2038363gyc.31 for ; Wed, 23 Feb 2011 14:40:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=j4Dmt3OpS+W4CWD6ph4s6IZrTc/4ZIWNuXzyYJWpyM8=; b=PAzmCRyypr2yqSLS9n36I0Nh06J1SeaPkEs3A4qEPdgeBjv0zb1WKYanmD1lOLLHUy P04gey4MWY+4stX/cdrMqdT6yplc4LSQ/UTQ2DTxVL544jTYd1v9BieAFl4k3Ynl3ORY 6I2PIPgyw0ij0wGBDoN8qLiOKGmyNFlHDIH4k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=LHRMCvEX8OTi+VOAGXxN6eW+JFfFMaTHNPlTkuMJLaQsPdmG2p9NNscPnzBhoyZU35 4DGH1LR0XWL2gUtqJ1K93TcpNDFx5ZeZ34wgYtc0PDB2SBHs/ttkvVjDatDqC6ndha2r nIcHuTQJK/q2Wbh9+4jLMtSKxnvA+spfXeI4A= MIME-Version: 1.0 Received: by 10.151.144.18 with SMTP id w18mr765589ybn.110.1298500826090; Wed, 23 Feb 2011 14:40:26 -0800 (PST) Received: by 10.150.205.8 with HTTP; Wed, 23 Feb 2011 14:40:26 -0800 (PST) In-Reply-To: References: <16920482.109351.1298484740112.JavaMail.nabble@jim.nabble.com> Date: Wed, 23 Feb 2011 14:40:26 -0800 Message-ID: Subject: Re: How does Cassandra handle failure during synchronous writes From: Ritesh Tijoriwala To: user@cassandra.apache.org Cc: Anthony John Content-Type: multipart/alternative; boundary=0015174becf27ad028049cfac8f5 --0015174becf27ad028049cfac8f5 Content-Type: text/plain; charset=ISO-8859-1 hi Anthony, While you stated the facts right, I don't see how it relates to the question I ask. Can you elaborate specifically what happens in the case I mentioned above to Dave? thanks, Ritesh On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote: > Seems to me that the explanations are getting incredibly complicated - > while I submit the real issue is not! > > Salient points here:- > 1. To be guaranteed data consistency - the writes and reads have to be at > Quorum CL or more > 2. Any W/R at lesser CL means that the application has to handle the > inconsistency, or has to be tolerant of it > 3. Writing at "ANY" CL - a special case - means that writes will always go > through (as long as any node is up), even if the destination nodes are not > up. This is done via hinted handoff. But this can result in inconsistent > reads, and yes that is a problem but refer to pt-2 above > 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to > handle that case where a particular node is down and the write needs to be > replicated to it. But this will not cause inconsistent R as the hinted > handoff (in this case) only applies after Quorum is met - so a Quorum R is > not dependent on the down node being up, and having got the hint. > > Hope I state this appropriately! > > HTH, > > -JA > > > On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala < > tijoriwala.ritesh@gmail.com> wrote: > >> > Read repair will probably occur at that point (depending on your >> config), which would cause the newest value to propagate to more replicas. >> >> Is the newest value the "quorum" value which means it is the old value >> that will be written back to the nodes having "newer non-quorum" value or >> the newest value is the real new value? :) If later, than this seems kind of >> odd to me and how it will be useful to any application. A bug? >> >> Thanks, >> Ritesh >> >> >> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote: >> >>> Ritesh, >>> >>> You have seen the problem. Clients may read the newly written value even >>> though the client performing the write saw it as a failure. When the client >>> reads, it will use the correct number of replicas for the chosen CL, then >>> return the newest value seen at any replica. This "newest value" could be >>> the result of a failed write. >>> >>> Read repair will probably occur at that point (depending on your config), >>> which would cause the newest value to propagate to more replicas. >>> >>> R+W>N guarantees serial order of operations: any read at CL=R that occurs >>> after a write at CL=W will observe the write. I don't think this property is >>> relevant to your current question, though. >>> >>> Cassandra has no mechanism to "roll back" the partial write, other than >>> to simply write again. This may also fail. >>> >>> Best, >>> Dave >>> >>> >>> On Wed, Feb 23, 2011 at 10:12 AM, wrote: >>> >>>> Hi Dave, >>>> Thanks for your input. In the steps you mention, what happens when >>>> client tries to read the value at step 6? Is it possible that the client may >>>> see the new value? My understanding was if R + W > N, then client will not >>>> see the new value as Quorum nodes will not agree on the new value. If that >>>> is the case, then its alright to return failure to the client. However, if >>>> not, then it is difficult to program as after every failure, you as an >>>> client are not sure if failure is a pseudo failure with some side effects or >>>> real failure. >>>> >>>> Thanks, >>>> Ritesh >>>> >>>> >>>> >>>> Ritesh, >>>> >>>> There is no commit protocol. Writes may be persisted on some replicas >>>> even >>>> though the quorum fails. Here's a sequence of events that shows the >>>> "problem:" >>>> >>>> 1. Some replica R fails, but recently, so its failure has not yet been >>>> detected >>>> 2. A client writes with consistency > 1 >>>> 3. The write goes to all replicas, all replicas except R persist the >>>> write >>>> to disk >>>> 4. Replica R never responds >>>> 5. Failure is returned to the client, but the new value is still in the >>>> cluster, on all replicas except R. >>>> >>>> Something very similar could happen for CL QUORUM. >>>> >>>> This is a conscious design decision because a commit protocol would >>>> constitute tight coupling between nodes, which goes against the >>>> Cassandra >>>> philosophy. But unfortunately you do have to write your app with this >>>> case >>>> in mind. >>>> >>>> Best, >>>> Dave >>>> >>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh < >>>> tijoriwala.ritesh@gmail.com> wrote: >>>> >>>> > >>>> > Hi, >>>> > I wanted to get details on how does cassandra do synchronous writes to >>>> W >>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal with >>>> > failures of of nodes before it gets to write to W replicas? If the >>>> > orchestrating node cannot write to W nodes successfully, I guess it >>>> will >>>> > fail the write operation but what happens to the completed writes on X >>>> (W >>>> > > >>>> > X) nodes? >>>> > >>>> > Thanks, >>>> > Ritesh >>>> > -- >>>> > View this message in context: >>>> > >>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html >>>> > Sent from the cassandra-user@incubator.apache.org mailing list >>>> archive at >>>> > Nabble.com. >>>> > >>>> >>>> >>>> Quoted from: >>>> >>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html >>>> >>> >>> >> > --0015174becf27ad028049cfac8f5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable hi Anthony,
While you stated the facts right, I don't see how it re= lates to the question I ask. Can you elaborate specifically what happens in= the case I mentioned above to Dave?

thanks,
Ritesh

On Wed, Feb 23, 2011 at 1:57 = PM, Anthony John <chirayithaj@gmail.com> wrote:
Seems to me that the explanations are getting incredibly complicated - whil= e I submit the real issue is not!

Salient points here:-<= /div>
1. To be guaranteed data consistency - the writes and reads have = to be at Quorum CL or more
2. Any W/R at lesser CL means that the application has to handle the i= nconsistency, or has to be tolerant of it
3. Writing at "ANY= " CL - a special case - means that writes will always go through (as l= ong as any node is up), even if the destination nodes are not up. This is d= one via hinted handoff. But this can result in inconsistent reads, and yes = that is a problem but refer to pt-2 above=A0
4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to= handle that case where a particular node is down and the write needs to be= replicated to it. But this will not cause inconsistent R as the hinted han= doff (in this case) only applies after Quorum is met - so a Quorum R is not= dependent on the down node being up, and having got the hint.

Hope I state this appropriately!

HTH,

-JA

<= br>
On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijor= iwala <tijoriwala.ritesh@gmail.com> wrote:
> Read repair will probably occur at= that point (depending on your config), which would cause the newest value = to propagate to more replicas.

Is the newest value the "quorum" value which= means it is the old value that will be written back to the nodes having &q= uot;newer non-quorum" value or the newest value is the real new value?= :) If later, than this seems kind of odd to me and how it will be useful t= o any application. A bug?

Thanks,
Ritesh


On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d= ave@meebo-inc.com> wrote:
Ritesh,

You have seen the= problem. Clients may read the newly written value even though the client p= erforming the write saw it as a failure. When the client reads, it will use= the correct number of replicas for the chosen CL, then return the newest v= alue seen at any replica. This "newest value" could be the result= of a failed write.

Read repair will probably occur at that point (dependin= g on your config), which would cause the newest value to propagate to more = replicas.

R+W>N guarantees serial order of oper= ations: any read at CL=3DR that occurs after a write at CL=3DW will observe= the write. I don't think this property is relevant to your current que= stion, though.

Cassandra has no mechanism to "roll back" the= partial write, other than to simply write again. This may also fail.
=

Best,
Dave


On Wed, Feb 23, 2011 at 10:12 AM, <<= a href=3D"mailto:tijoriwala.ritesh@gmail.com" target=3D"_blank">tijoriwala.= ritesh@gmail.com> wrote:
Hi Dave,
Thanks for your input. In the steps you mention, what happens when client t= ries to read the value at step 6? Is it possible that the client may see th= e new value? My understanding was if R + W > N, then client will not see= the new value as Quorum nodes will not agree on the new value. If that is = the case, then its alright to return failure to the client. However, if not= , then it is difficult to program as after every failure, you as an client = are not sure if failure is a pseudo failure with some side effects or real = failure.

Thanks,
Ritesh

<quote author=3D'Dave Revell'>

Ritesh,

There is no commit protocol. Writes may be persisted on some replicas even<= br> though the quorum fails. Here's a sequence of events that shows the
"problem:"

1. Some replica R fails, but recently, so its failure has not yet been
detected
2. A client writes with consistency > 1
3. The write goes to all replicas, all replicas except R persist the write<= br> to disk
4. Replica R never responds
5. Failure is returned to the client, but the new value is still in the
cluster, on all replicas except R.

Something very similar could happen for CL QUORUM.

This is a conscious design decision because a commit protocol would
constitute tight coupling between nodes, which goes against the Cassandra philosophy. But unfortunately you do have to write your app with this case<= br> in mind.

Best,
Dave

On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
tijoriwala= .ritesh@gmail.com> wrote:

>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to= W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with > failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it wi= ll
> fail the write operation but what happens to the completed writes on X= (W
> >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org= .3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous= -writes-tp6055152p6055152.html
> Sent from the cassandra-user@incubator.apache.org mailing list archi= ve at
> Nabble.com.
>

</quote>
Quoted from:
http://cassandra-user-incubator-apache-org.3065= 146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writ= es-tp6055152p6055408.html




--0015174becf27ad028049cfac8f5--