Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of
 JEREMIAH.JORDAN@morningstar.com designates 216.228.224.32 as permitted
 sender)
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01CC52CB.90C01E67"
Content-class: urn:content-classes:message
Subject: RE: Write everywhere, read anywhere
Date: Thu, 4 Aug 2011 12:25:52 -0500
Message-ID: <05CEA178DD88EE4FA89EED77C245F849111A310A@msex85.morningstar.com>
In-Reply-To: 
 <CAK8yG9=FARO19Pv8F+RRmsWx5-SoJsz34N9gTnnbYoto3f=p3w@mail.gmail.com>
Thread-Topic: Write everywhere, read anywhere
Thread-Index: AcxSykNixLxjlrdNQuSpCvLCKyDuSgAAODNg
References: 
 <CAHwsXY=DkfQc0VQziORs4QvQnZ9o6zuoNX7FUg6DNP2kxTPBEg@mail.gmail.com><CA+sHyy_pnq-wP4-b6iYXQNe7iDOdfAf+jZOYJsmLRGurMRQvnw@mail.gmail.com>
 <CAK8yG9=FARO19Pv8F+RRmsWx5-SoJsz34N9gTnnbYoto3f=p3w@mail.gmail.com>
From: "Jeremiah Jordan" <JEREMIAH.JORDAN@morningstar.com>
To: <user@cassandra.apache.org>

This is a multi-part message in MIME format.

------_=_NextPart_001_01CC52CB.90C01E67
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

If you have RF=3D3 quorum won't fail with one node down.  So R/W quorum =
will be consistent in the case of one node down.  If two nodes go down =
at the same time, then you can get inconsistent data from quorum =
write/read if the write fails with TimeOut, the nodes come back up, and =
then read asks the two nodes that were down what the value is.  And =
another read asks the node that was up, and a node that was down.  Those =
two reads will get different answers.

=20

From: Mike Malone [mailto:mike@simplegeo.com]=20
Sent: Thursday, August 04, 2011 12:16 PM
To: user@cassandra.apache.org
Subject: Re: Write everywhere, read anywhere

=20

=20

2011/8/3 Patricio Echag=FCe <patricioe@gmail.com>

=20

On Wed, Aug 3, 2011 at 4:00 PM, Philippe <watcherfr@gmail.com> wrote:

Hello,

I have a 3-node, RF=3D3, cluster configured to write at CL.ALL and read =
at CL.ONE. When I take one of the nodes down, writes fail which is what =
I expect.

When I run a repair, I see data being streamed from those column =
families... that I didn't expect. How can the nodes diverge ? Does this =
mean that reading at CL.ONE may return inconsistent data ?

=20

we abort the mutation before hand when there are enough replicas alive. =
If a mutation went through and in the middle of it a replica goes down, =
in that case you can write to some nodes and the request will Timeout.

In that case the CL.ONE may return inconsistence data.=20

=20

Doesn't CL.QUORUM suffer from the same problem? There's no isolation or =
rollback with CL.QUORUM either. So if I do a quorum write with RF=3D3 =
and it fails after hitting a single node, a subsequent quorum read could =
return the old data (if it hits the two nodes that didn't receive the =
write) or the new data that failed mid-write (if it hits the node that =
did receive the write).

=20

Basically, the scenarios where CL.ALL + CL.ONE results in a read of =
inconsistent data could also cause a CL.QUORUM write followed by a =
CL.QUORUM read to return inconsistent data. Right? The problem (if there =
is one) is that even in the quorum case columns with the most recent =
timestamp win during repair resolution, not columns that have quorum =
consensus.

=20

Mike


------_=_NextPart_001_01CC52CB.90C01E67
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>

<meta name=3DGenerator content=3D"Microsoft Word 12 (filtered medium)">
<style>
<!--
 /* Font Definitions */
 @font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
	{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
 <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
 <o:shapelayout v:ext=3D"edit">
  <o:idmap v:ext=3D"edit" data=3D"1" />
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=3DEN-US link=3Dblue vlink=3Dpurple>

<div class=3DSection1>

<p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'>If you have RF=3D3 quorum won&#8217;t fail with one node =
down.=A0 So R/W
quorum will be consistent in the case of one node down.=A0 If two nodes =
go down
at the same time, then you can get inconsistent data from quorum =
write/read if
the write fails with TimeOut, the nodes come back up, and then read asks =
the
two nodes that were down what the value is.=A0 And another read asks the =
node
that was up, and a node that was down.=A0 Those two reads will get =
different
answers.<o:p></o:p></span></p>

<p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'><o:p>&nbsp;</o:p></span></p>

<div style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt =
0in 0in 0in'>

<p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Mike =
Malone
[mailto:mike@simplegeo.com] <br>
<b>Sent:</b> Thursday, August 04, 2011 12:16 PM<br>
<b>To:</b> user@cassandra.apache.org<br>
<b>Subject:</b> Re: Write everywhere, read =
anywhere<o:p></o:p></span></p>

</div>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'><o:p>&nbsp;</o:p></p>

<div>

<p class=3DMsoNormal>2011/8/3 Patricio Echag=FCe &lt;<a
href=3D"mailto:patricioe@gmail.com">patricioe@gmail.com</a>&gt;<o:p></o:p=
></p>

<p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'><o:p>&nbsp;</o:p></p>

<div>

<div>

<p class=3DMsoNormal>On Wed, Aug 3, 2011 at 4:00 PM, Philippe &lt;<a
href=3D"mailto:watcherfr@gmail.com" =
target=3D"_blank">watcherfr@gmail.com</a>&gt;
wrote:<o:p></o:p></p>

<p class=3DMsoNormal>Hello,<o:p></o:p></p>

<div>

<p class=3DMsoNormal>I have a 3-node, RF=3D3, cluster configured to =
write at CL.ALL
and read at CL.ONE. When I take one of the nodes down, writes fail which =
is
what I expect.<o:p></o:p></p>

</div>

<div>

<p class=3DMsoNormal>When I run a repair, I see data being streamed from =
those
column families... that I didn't expect. How can the nodes diverge ? =
Does this
mean that reading at CL.ONE may return inconsistent data =
?<o:p></o:p></p>

</div>

<div>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

</div>

</div>

<div>

<p class=3DMsoNormal>we abort the mutation before hand when there are =
enough
replicas alive. If a mutation went through and in the middle of it a =
replica
goes down, in that case you can write to some nodes and the request will
Timeout.<o:p></o:p></p>

</div>

<div>

<p class=3DMsoNormal>In that case the CL.ONE may return inconsistence =
data.&nbsp;<o:p></o:p></p>

</div>

</div>

<div>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

</div>

<div>

<p class=3DMsoNormal>Doesn't CL.QUORUM suffer from the same problem? =
There's no
isolation or rollback with CL.QUORUM either. So if I do a quorum write =
with
RF=3D3 and it fails after hitting a single node, a subsequent quorum =
read could
return the old data (if it hits the two nodes that didn't receive the =
write) or
the new data that failed mid-write (if it hits the node that did receive =
the
write).<o:p></o:p></p>

</div>

<div>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

</div>

<div>

<p class=3DMsoNormal>Basically, the scenarios where CL.ALL + CL.ONE =
results in a
read of inconsistent data could also cause a CL.QUORUM write followed by =
a
CL.QUORUM read to return inconsistent data. Right? The problem (if there =
is
one) is that even in the quorum case columns with the most recent =
timestamp win
during repair resolution, not columns that have quorum =
consensus.<o:p></o:p></p>

</div>

<div>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

</div>

<div>

<p class=3DMsoNormal>Mike<o:p></o:p></p>

</div>

</div>

</div>

</body>

</html>

------_=_NextPart_001_01CC52CB.90C01E67--