Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 56EA1DB89 for ; Fri, 21 Dec 2012 03:40:17 +0000 (UTC) Received: (qmail 11532 invoked by uid 500); 21 Dec 2012 03:40:14 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 11307 invoked by uid 500); 21 Dec 2012 03:40:14 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 11284 invoked by uid 99); 21 Dec 2012 03:40:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Dec 2012 03:40:13 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a93.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Dec 2012 03:40:08 +0000 Received: from homiemail-a93.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a93.g.dreamhost.com (Postfix) with ESMTP id 507B78405B for ; Thu, 20 Dec 2012 19:39:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=APQdWtso0CSHqoL9lh4jvvniQ6 M=; b=40pp6ee3yONMnovEEG36wHJH298JM5MrL6RGt29syJQtcTuWZ1NUwUqHY0 UXqmA6KFBO81+3DV0kWp0UnCJrRrhlMu9HMDW8b8RGYCuoBtwCyDRVvLvneJyylx B6A//URkg3M3eqhue0ot/cwp02aQSgZ2AuLKxO2dO8O5kWvQM= Received: from [172.16.1.7] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a93.g.dreamhost.com (Postfix) with ESMTPSA id 96AA884059 for ; Thu, 20 Dec 2012 19:39:46 -0800 (PST) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Replication Factor and Consistency Level Confusion Date: Fri, 21 Dec 2012 16:39:44 +1300 References: <69989DC961D0DB4D805CA94CF4607371150DFE01@AMSPRD0710MB365.eurprd07.prod.outlook.com> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii >> this actually what is happening, how is it possible to ever have a >> node-failure resiliant cassandra cluster?=20 Background http://thelastpickle.com/2011/06/13/Down-For-Me/ > I would suggest double-checking your test setup; also, make sure you > use the same row keys every time (if this is not already the case) so > that you have repeatable results. Take a look at the nodetool getendpoints command. It will tell you which = nodes a key is stored on. Though for RF 3 and N3 it's all of them :) Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/12/2012, at 12:54 AM, Tristan Seligmann = wrote: > On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos > wrote: >> Initially we were thinking the same thing, that an explanation would >> be that the "wrong" node could be down, but then isn't this something >> that hinted handoff sorts out? >=20 > If a node is partitioned from the rest of the cluster (ie. the node > goes down, but later comes back with the same data it had), it will > obviously be out of data with regard to any writes that happened while > it was down. Anti-entropy (nodetool repair) and read repair will > repair this inconsistency over time, but not right away; hinted > handoff is an optimization that will allow the node to become mostly > consistent right away on rejoining the cluster, as the nodes will have > stored hints for it while it was down, and will send it them once the > node is back up. >=20 > However, the important thing to note is that this is an > /optimization/. If a replica is down, then it will not be able to > satisfy any consistency level requirements, except for the special > case of CL=3DANY. If you use another CL like TWO, then two actual > replica nodes must be up for the ranges you are writing to, a node > that is not a replica but will write a hint does not count. >=20 >> Test 2 (2/3 Nodes UP): >> CL : ANY ONE TWO THREE QUORUM ALL >> RF 2: OK OK x x OK x >=20 > For this test, QUORUM =3D RF/2+1 =3D 2/2+1 =3D 2. A write at QUORUM = should > have succeded if both of the replicas for the range were up, but if > one of the replicas for the range was the downed node, then it would > have failed. I think you can use the 'nodetool getendpoints' command > to list the nodes that are replicas for the given row key. >=20 > I am unable to explain how a write at QUORUM could succeed if a write > at TWO for the same key failed. >=20 >> Test 3 (2/3 Nodes UP): >> CL : ANY ONE TWO THREE QUORUM ALL >> RF 3: OK OK x x OK OK >=20 > For this test, QUORUM =3D RF/2+1 =3D 3/2+1 =3D 2. Again, I am unable = to > explain why a write at QUORUM would succeed if a write at TWO failed, > and I am also unable to explain how a write at ALL could succeed, for > any key, if one of the nodes is down. >=20 > I would suggest double-checking your test setup; also, make sure you > use the same row keys every time (if this is not already the case) so > that you have repeatable results. >=20 >> Furthermore, with regards to being "unlucky" with the "wrong node" if >> this actually what is happening, how is it possible to ever have a >> node-failure resiliant cassandra cluster? My understanding of this >> implies that even with 100 nodes, every 1/100 writes would fail until >> the node is replaced/repaired. >=20 > RF is the important number when considering fault-tolerance in your > cluster, not the number of nodes. If RF=3D3, and you read and write at > quorum, then you can tolerate one node being down in the range you are > operating on. If you need to be able to tolerate two nodes being down, > RF=3D5 and QUORUM would work. In other words, if you need better fault > tolerance, RF is what you need to increase; if you need better > performance, or you need to store more data, then N (number of nodes > in cluster) is what you need to increase. Of course, N must be at > least as big as RF... > -- > mithrandi, i Ainil en-Balandor, a faer Ambar --Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
this actually what = is happening, how is it possible to ever have a
node-failure = resiliant cassandra = cluster? 
Background http://thelastpi= ckle.com/2011/06/13/Down-For-Me/

I would suggest double-checking your test setup; also, = make sure you
use the same row keys every time (if this is not = already the case) so
that you have repeatable = results.
Take a look at the nodetool getendpoints = command. It will tell you which nodes a key is stored on. Though for RF = 3 and N3 it's all of them = :)

Cheers

http://www.thelastpickle.com

On 21/12/2012, at 12:54 AM, Tristan Seligmann <mithrandi@mithrandi.net> = wrote:

On Thu, Dec 20, 2012 at 11:26 AM, Vasileios = Vlachos
<vasileiosvlachos@gmail.com&= gt; wrote:
Initially we were thinking the = same thing, that an explanation would
be that the "wrong" node could = be down, but then isn't this something
that hinted handoff sorts = out?

If a node is partitioned from the rest of the = cluster (ie. the node
goes down, but later comes back with the same = data it had), it will
obviously be out of data with regard to any = writes that happened while
it was down. Anti-entropy (nodetool = repair) and read repair will
repair this inconsistency over time, but = not right away; hinted
handoff is an optimization that will allow the = node to become mostly
consistent right away on rejoining the cluster, = as the nodes will have
stored hints for it while it was down, and = will send it them once the
node is back up.

However, the = important thing to note is that this is an
/optimization/. If a = replica is down, then it will not be able to
satisfy any consistency = level requirements, except for the special
case of CL=3DANY. If you = use another CL like TWO, then two actual
replica nodes must be up for = the ranges you are writing to, a node
that is not a replica but will = write a hint does not count.

Test 2 = (2/3 Nodes UP):
CL  :    ANY =    ONE    TWO    THREE =    QUORUM    ALL
RF 2: =    OK     OK =     x      x =        OK =        x

For this = test, QUORUM =3D RF/2+1 =3D 2/2+1 =3D 2. A write at QUORUM = should
have succeded if both of the replicas for the range were up, = but if
one of the replicas for the range was the downed node, then it = would
have failed. I think you can use the 'nodetool getendpoints' = command
to list the nodes that are replicas for the given row = key.

I am unable to explain how a write at QUORUM could succeed = if a write
at TWO for the same key failed.

Test 3 (2/3 Nodes UP):
CL  :    ANY =    ONE    TWO    THREE =    QUORUM    ALL
RF 3: =    OK     OK =     x      x =        OK =        OK

For = this test, QUORUM =3D RF/2+1 =3D 3/2+1 =3D 2. Again, I am unable = to
explain why a write at QUORUM would succeed if a write at TWO = failed,
and I am also unable to explain how a write at ALL could = succeed, for
any key, if one of the nodes is down.

I would = suggest double-checking your test setup; also, make sure you
use the = same row keys every time (if this is not already the case) so
that = you have repeatable results.

Furthermore,= with regards to being "unlucky" with the "wrong node" if
this = actually what is happening, how is it possible to ever have = a
node-failure resiliant cassandra cluster? My understanding of = this
implies that even with 100 nodes, every 1/100 writes would fail = until
the node is replaced/repaired.

RF is the = important number when considering fault-tolerance in your
cluster, = not the number of nodes. If RF=3D3, and you read and write at
quorum, = then you can tolerate one node being down in the range you = are
operating on. If you need to be able to tolerate two nodes being = down,
RF=3D5 and QUORUM would work. In other words, if you need = better fault
tolerance, RF is what you need to increase; if you need = better
performance, or you need to store more data, then N (number of = nodes
in cluster) is what you need to increase. Of course, N must be = at
least as big as RF...
--
mithrandi, i Ainil en-Balandor, a = faer Ambar

= --Apple-Mail=_76918EFD-8585-4930-8A2B-796E38D55711--