Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711"
Message-Id: <CC014AD4-3757-453C-B022-B2EC47C3F951@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Replication Factor and Consistency Level Confusion
Date: Fri, 21 Dec 2012 16:39:44 +1300
References: 
 <CAEfD-ZBYmiFpAaYdKDxw4fLFLqHsgvxe415LYq3_X7a+tZiALg@mail.gmail.com>
 <69989DC961D0DB4D805CA94CF4607371150DFE01@AMSPRD0710MB365.eurprd07.prod.outlook.com>
 <CAEfD-ZALTEFXfb2av1Sj5z_c2OJ6Wr1k3-Y3JjDQKSG5bHfLTw@mail.gmail.com>
 <CAMcKhMRzabAwW-GNqY46cs8_oJSv0F_t3LHrF8m94ARi7-hR7Q@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAMcKhMRzabAwW-GNqY46cs8_oJSv0F_t3LHrF8m94ARi7-hR7Q@mail.gmail.com>


--Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster?=20
Background http://thelastpickle.com/2011/06/13/Down-For-Me/

> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
Take a look at the nodetool getendpoints command. It will tell you which =
nodes a key is stored on. Though for RF 3 and N3 it's all of them :)

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/12/2012, at 12:54 AM, Tristan Seligmann <mithrandi@mithrandi.net> =
wrote:

> On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos
> <vasileiosvlachos@gmail.com> wrote:
>> Initially we were thinking the same thing, that an explanation would
>> be that the "wrong" node could be down, but then isn't this something
>> that hinted handoff sorts out?
>=20
> If a node is partitioned from the rest of the cluster (ie. the node
> goes down, but later comes back with the same data it had), it will
> obviously be out of data with regard to any writes that happened while
> it was down. Anti-entropy (nodetool repair) and read repair will
> repair this inconsistency over time, but not right away; hinted
> handoff is an optimization that will allow the node to become mostly
> consistent right away on rejoining the cluster, as the nodes will have
> stored hints for it while it was down, and will send it them once the
> node is back up.
>=20
> However, the important thing to note is that this is an
> /optimization/. If a replica is down, then it will not be able to
> satisfy any consistency level requirements, except for the special
> case of CL=3DANY. If you use another CL like TWO, then two actual
> replica nodes must be up for the ranges you are writing to, a node
> that is not a replica but will write a hint does not count.
>=20
>> Test 2 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 2:    OK     OK     x      x        OK        x
>=20
> For this test, QUORUM =3D RF/2+1 =3D 2/2+1 =3D 2. A write at QUORUM =
should
> have succeded if both of the replicas for the range were up, but if
> one of the replicas for the range was the downed node, then it would
> have failed. I think you can use the 'nodetool getendpoints' command
> to list the nodes that are replicas for the given row key.
>=20
> I am unable to explain how a write at QUORUM could succeed if a write
> at TWO for the same key failed.
>=20
>> Test 3 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 3:    OK     OK     x      x        OK        OK
>=20
> For this test, QUORUM =3D RF/2+1 =3D 3/2+1 =3D 2. Again, I am unable =
to
> explain why a write at QUORUM would succeed if a write at TWO failed,
> and I am also unable to explain how a write at ALL could succeed, for
> any key, if one of the nodes is down.
>=20
> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
>=20
>> Furthermore, with regards to being "unlucky" with the "wrong node" if
>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster? My understanding of this
>> implies that even with 100 nodes, every 1/100 writes would fail until
>> the node is replaced/repaired.
>=20
> RF is the important number when considering fault-tolerance in your
> cluster, not the number of nodes. If RF=3D3, and you read and write at
> quorum, then you can tolerate one node being down in the range you are
> operating on. If you need to be able to tolerate two nodes being down,
> RF=3D5 and QUORUM would work. In other words, if you need better fault
> tolerance, RF is what you need to increase; if you need better
> performance, or you need to store more data, then N (number of nodes
> in cluster) is what you need to increase. Of course, N must be at
> least as big as RF...
> --
> mithrandi, i Ainil en-Balandor, a faer Ambar


--Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><blockquote type=3D"cite">this actually what =
is happening, how is it possible to ever have a<br>node-failure =
resiliant cassandra =
cluster?&nbsp;</blockquote></blockquote>Background&nbsp;<a =
href=3D"http://thelastpickle.com/2011/06/13/Down-For-Me/">http://thelastpi=
ckle.com/2011/06/13/Down-For-Me/</a><div><br></div><div><blockquote =
type=3D"cite">I would suggest double-checking your test setup; also, =
make sure you<br>use the same row keys every time (if this is not =
already the case) so<br>that you have repeatable =
results.<br></blockquote>Take a look at the nodetool&nbsp;getendpoints =
command. It will tell you which nodes a key is stored on. Though for RF =
3 and N3 it's all of them =
:)</div><div><br></div><div>Cheers</div><div><br><div =
apple-content-edited=3D"true">
<div style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
medium; font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
border-spacing: 0px; -webkit-text-decorations-in-effect: none; =
-webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; =
font-size: medium; "><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>New =
Zealand</div><div><br></div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></span></div></span></div></span></div></span></div>
</div>

<br><div><div>On 21/12/2012, at 12:54 AM, Tristan Seligmann &lt;<a =
href=3D"mailto:mithrandi@mithrandi.net">mithrandi@mithrandi.net</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">On Thu, Dec 20, 2012 at 11:26 AM, Vasileios =
Vlachos<br>&lt;<a =
href=3D"mailto:vasileiosvlachos@gmail.com">vasileiosvlachos@gmail.com</a>&=
gt; wrote:<br><blockquote type=3D"cite">Initially we were thinking the =
same thing, that an explanation would<br>be that the "wrong" node could =
be down, but then isn't this something<br>that hinted handoff sorts =
out?<br></blockquote><br>If a node is partitioned from the rest of the =
cluster (ie. the node<br>goes down, but later comes back with the same =
data it had), it will<br>obviously be out of data with regard to any =
writes that happened while<br>it was down. Anti-entropy (nodetool =
repair) and read repair will<br>repair this inconsistency over time, but =
not right away; hinted<br>handoff is an optimization that will allow the =
node to become mostly<br>consistent right away on rejoining the cluster, =
as the nodes will have<br>stored hints for it while it was down, and =
will send it them once the<br>node is back up.<br><br>However, the =
important thing to note is that this is an<br>/optimization/. If a =
replica is down, then it will not be able to<br>satisfy any consistency =
level requirements, except for the special<br>case of CL=3DANY. If you =
use another CL like TWO, then two actual<br>replica nodes must be up for =
the ranges you are writing to, a node<br>that is not a replica but will =
write a hint does not count.<br><br><blockquote type=3D"cite">Test 2 =
(2/3 Nodes UP):<br>CL &nbsp;: &nbsp;&nbsp;&nbsp;ANY =
&nbsp;&nbsp;&nbsp;ONE &nbsp;&nbsp;&nbsp;TWO &nbsp;&nbsp;&nbsp;THREE =
&nbsp;&nbsp;&nbsp;QUORUM &nbsp;&nbsp;&nbsp;ALL<br>RF 2: =
&nbsp;&nbsp;&nbsp;OK &nbsp;&nbsp;&nbsp;&nbsp;OK =
&nbsp;&nbsp;&nbsp;&nbsp;x &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OK =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x<br></blockquote><br>For this =
test, QUORUM =3D RF/2+1 =3D 2/2+1 =3D 2. A write at QUORUM =
should<br>have succeded if both of the replicas for the range were up, =
but if<br>one of the replicas for the range was the downed node, then it =
would<br>have failed. I think you can use the 'nodetool getendpoints' =
command<br>to list the nodes that are replicas for the given row =
key.<br><br>I am unable to explain how a write at QUORUM could succeed =
if a write<br>at TWO for the same key failed.<br><br><blockquote =
type=3D"cite">Test 3 (2/3 Nodes UP):<br>CL &nbsp;: &nbsp;&nbsp;&nbsp;ANY =
&nbsp;&nbsp;&nbsp;ONE &nbsp;&nbsp;&nbsp;TWO &nbsp;&nbsp;&nbsp;THREE =
&nbsp;&nbsp;&nbsp;QUORUM &nbsp;&nbsp;&nbsp;ALL<br>RF 3: =
&nbsp;&nbsp;&nbsp;OK &nbsp;&nbsp;&nbsp;&nbsp;OK =
&nbsp;&nbsp;&nbsp;&nbsp;x &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OK =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OK<br></blockquote><br>For =
this test, QUORUM =3D RF/2+1 =3D 3/2+1 =3D 2. Again, I am unable =
to<br>explain why a write at QUORUM would succeed if a write at TWO =
failed,<br>and I am also unable to explain how a write at ALL could =
succeed, for<br>any key, if one of the nodes is down.<br><br>I would =
suggest double-checking your test setup; also, make sure you<br>use the =
same row keys every time (if this is not already the case) so<br>that =
you have repeatable results.<br><br><blockquote type=3D"cite">Furthermore,=
 with regards to being "unlucky" with the "wrong node" if<br>this =
actually what is happening, how is it possible to ever have =
a<br>node-failure resiliant cassandra cluster? My understanding of =
this<br>implies that even with 100 nodes, every 1/100 writes would fail =
until<br>the node is replaced/repaired.<br></blockquote><br>RF is the =
important number when considering fault-tolerance in your<br>cluster, =
not the number of nodes. If RF=3D3, and you read and write at<br>quorum, =
then you can tolerate one node being down in the range you =
are<br>operating on. If you need to be able to tolerate two nodes being =
down,<br>RF=3D5 and QUORUM would work. In other words, if you need =
better fault<br>tolerance, RF is what you need to increase; if you need =
better<br>performance, or you need to store more data, then N (number of =
nodes<br>in cluster) is what you need to increase. Of course, N must be =
at<br>least as big as RF...<br>--<br>mithrandi, i Ainil en-Balandor, a =
faer Ambar<br></blockquote></div><br></div></body></html>=

--Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711--