Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF3F56629 for ; Thu, 4 Aug 2011 17:26:25 +0000 (UTC) Received: (qmail 42375 invoked by uid 500); 4 Aug 2011 17:26:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 42287 invoked by uid 500); 4 Aug 2011 17:26:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 42277 invoked by uid 99); 4 Aug 2011 17:26:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Aug 2011 17:26:22 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of JEREMIAH.JORDAN@morningstar.com designates 216.228.224.32 as permitted sender) Received: from [216.228.224.32] (HELO mx85.morningstar.com) (216.228.224.32) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 04 Aug 2011 17:26:15 +0000 X-MimeOLE: Produced By Microsoft Exchange V6.5 x-cr-puzzleid: {40A1DE45-618C-41A9-9FAE-87A4503842AB} MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01CC52CB.90C01E67" x-cr-hashedpuzzle: GxU= AQop ASBK AcGv BKFT BMMP CCcm D6e5 GJFi H/zh IJsv JdDf KcsQ LklY LxLI L5cn;1;dQBzAGUAcgBAAGMAYQBzAHMAYQBuAGQAcgBhAC4AYQBwAGEAYwBoAGUALgBvAHIAZwA=;Sosha1_v1;7;{40A1DE45-618C-41A9-9FAE-87A4503842AB};agBlAHIAZQBtAGkAYQBoAC4AagBvAHIAZABhAG4AQABtAG8AcgBuAGkAbgBnAHMAdABhAHIALgBjAG8AbQA=;Thu, 04 Aug 2011 17:25:52 GMT;UgBFADoAIABXAHIAaQB0AGUAIABlAHYAZQByAHkAdwBoAGUAcgBlACwAIAByAGUAYQBkACAAYQBuAHkAdwBoAGUAcgBlAA== Content-class: urn:content-classes:message Subject: RE: Write everywhere, read anywhere Date: Thu, 4 Aug 2011 12:25:52 -0500 Message-ID: <05CEA178DD88EE4FA89EED77C245F849111A310A@msex85.morningstar.com> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Write everywhere, read anywhere Thread-Index: AcxSykNixLxjlrdNQuSpCvLCKyDuSgAAODNg References: From: "Jeremiah Jordan" To: This is a multi-part message in MIME format. ------_=_NextPart_001_01CC52CB.90C01E67 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable If you have RF=3D3 quorum won't fail with one node down. So R/W quorum = will be consistent in the case of one node down. If two nodes go down = at the same time, then you can get inconsistent data from quorum = write/read if the write fails with TimeOut, the nodes come back up, and = then read asks the two nodes that were down what the value is. And = another read asks the node that was up, and a node that was down. Those = two reads will get different answers. =20 From: Mike Malone [mailto:mike@simplegeo.com]=20 Sent: Thursday, August 04, 2011 12:16 PM To: user@cassandra.apache.org Subject: Re: Write everywhere, read anywhere =20 =20 2011/8/3 Patricio Echag=FCe =20 On Wed, Aug 3, 2011 at 4:00 PM, Philippe wrote: Hello, I have a 3-node, RF=3D3, cluster configured to write at CL.ALL and read = at CL.ONE. When I take one of the nodes down, writes fail which is what = I expect. When I run a repair, I see data being streamed from those column = families... that I didn't expect. How can the nodes diverge ? Does this = mean that reading at CL.ONE may return inconsistent data ? =20 we abort the mutation before hand when there are enough replicas alive. = If a mutation went through and in the middle of it a replica goes down, = in that case you can write to some nodes and the request will Timeout. In that case the CL.ONE may return inconsistence data.=20 =20 Doesn't CL.QUORUM suffer from the same problem? There's no isolation or = rollback with CL.QUORUM either. So if I do a quorum write with RF=3D3 = and it fails after hitting a single node, a subsequent quorum read could = return the old data (if it hits the two nodes that didn't receive the = write) or the new data that failed mid-write (if it hits the node that = did receive the write). =20 Basically, the scenarios where CL.ALL + CL.ONE results in a read of = inconsistent data could also cause a CL.QUORUM write followed by a = CL.QUORUM read to return inconsistent data. Right? The problem (if there = is one) is that even in the quorum case columns with the most recent = timestamp win during repair resolution, not columns that have quorum = consensus. =20 Mike ------_=_NextPart_001_01CC52CB.90C01E67 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

If you have RF=3D3 quorum won’t fail with one node = down.=A0 So R/W quorum will be consistent in the case of one node down.=A0 If two nodes = go down at the same time, then you can get inconsistent data from quorum = write/read if the write fails with TimeOut, the nodes come back up, and then read asks = the two nodes that were down what the value is.=A0 And another read asks the = node that was up, and a node that was down.=A0 Those two reads will get = different answers.

 

From:= Mike = Malone [mailto:mike@simplegeo.com]
Sent: Thursday, August 04, 2011 12:16 PM
To: user@cassandra.apache.org
Subject: Re: Write everywhere, read = anywhere

 

 

2011/8/3 Patricio Echag=FCe <patricioe@gmail.com>

 

On Wed, Aug 3, 2011 at 4:00 PM, Philippe <watcherfr@gmail.com> wrote:

Hello,

I have a 3-node, RF=3D3, cluster configured to = write at CL.ALL and read at CL.ONE. When I take one of the nodes down, writes fail which = is what I expect.

When I run a repair, I see data being streamed from = those column families... that I didn't expect. How can the nodes diverge ? = Does this mean that reading at CL.ONE may return inconsistent data = ?

 

we abort the mutation before hand when there are = enough replicas alive. If a mutation went through and in the middle of it a = replica goes down, in that case you can write to some nodes and the request will Timeout.

In that case the CL.ONE may return inconsistence = data. 

 

Doesn't CL.QUORUM suffer from the same problem? = There's no isolation or rollback with CL.QUORUM either. So if I do a quorum write = with RF=3D3 and it fails after hitting a single node, a subsequent quorum = read could return the old data (if it hits the two nodes that didn't receive the = write) or the new data that failed mid-write (if it hits the node that did receive = the write).

 

Basically, the scenarios where CL.ALL + CL.ONE = results in a read of inconsistent data could also cause a CL.QUORUM write followed by = a CL.QUORUM read to return inconsistent data. Right? The problem (if there = is one) is that even in the quorum case columns with the most recent = timestamp win during repair resolution, not columns that have quorum = consensus.

 

Mike

------_=_NextPart_001_01CC52CB.90C01E67--