From user-return-30559-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Dec 11 20:23:40 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E485DCDE for ; Tue, 11 Dec 2012 20:23:40 +0000 (UTC) Received: (qmail 22106 invoked by uid 500); 11 Dec 2012 20:23:37 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 22082 invoked by uid 500); 11 Dec 2012 20:23:37 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 22074 invoked by uid 99); 11 Dec 2012 20:23:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:23:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a83.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:23:31 +0000 Received: from homiemail-a83.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a83.g.dreamhost.com (Postfix) with ESMTP id 6FAD95E07D for ; Tue, 11 Dec 2012 12:23:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=hZVGTvvYPjq8h045TcY0mlI379 0=; b=CarPXctw6ngqVxYScH8FcEjSChN84CpL5WV6KUPKr4yu1SabTKxRivXnr3 6hSa+s1PQ8Tvq1lxCqFC6gu9Wqp7J3BPoubx5GlCfk+pnSzbBbntY6d9Yjw9yiLD 4MFee9J/18le94jJcZH7Iky7UMGra3PDVdKMWtey9Tb373zh0= Received: from [172.16.1.7] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a83.g.dreamhost.com (Postfix) with ESMTPSA id 971C15E084 for ; Tue, 11 Dec 2012 12:23:10 -0800 (PST) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_B9309843-975B-45C7-8E54-498DB784FF34" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count? Date: Wed, 12 Dec 2012 09:23:09 +1300 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_B9309843-975B-45C7-8E54-498DB784FF34 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > Is it possible to configure or write a snitch that would create = separate distribution zones within the cluster? (e.g. 144 nodes in = cluster, split into 12 zones. Data stored to node 1 could only be = replicated to one of 11 other nodes in the same distribution zone). This is kind of what NTS does if you have nodes in different racks.=20 A replica is placed in each rack, and the process wraps around and = continues until RF replicas are located. If the number of racks is not = equal to the RF you then get some unevenness (how what do you know, = that's a real word :) )=20 Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/12/2012, at 6:42 AM, Eric Parusel wrote: > Ok, thanks Richard. That's good to hear. >=20 > However, I still contend that as node count increases to infinity, the = probability of there being at least two node failures in the cluster at = any time would increase to 100%. >=20 > I think of this as somewhat analogous to RAID -- I would not be = comfortable with a 144+ disk RAID 6 array, no matter the rebuild speed = :) >=20 > Is it possible to configure or write a snitch that would create = separate distribution zones within the cluster? (e.g. 144 nodes in = cluster, split into 12 zones. Data stored to node 1 could only be = replicated to one of 11 other nodes in the same distribution zone). >=20 >=20 > On Tue, Dec 11, 2012 at 3:24 AM, Richard Low wrote: > Hi Eric, >=20 > The time to recover one node is limited by that node, but the time to = recover that's most important is just the time to replicate the data = that is missing from that node. This is the removetoken operation = (called removenode in 1.2), and this gets faster the more nodes you = have. >=20 > Richard. >=20 >=20 > On 11 December 2012 08:39, Eric Parusel wrote: > Thanks for your thoughts guys. >=20 > I agree that with vnodes total downtime is lessened. Although it also = seems that the total number of outages (however small) would be greater. >=20 > But I think downtime is only lessened up to a certain cluster size. >=20 > I'm thinking that as the cluster continues to grow: > - node rebuild time will max out (a single node only has so much = write bandwidth) > - the probability of 2 nodes being down at any given time will = continue to increase -- even if you consider only non-correlated = failures. >=20 > Therefore, when adding nodes beyond the point where node rebuild time = maxes out, both the total number of outages *and* overall downtime would = increase? >=20 > Thanks, > Eric >=20 >=20 >=20 >=20 > On Mon, Dec 10, 2012 at 7:00 AM, Edward Capriolo = wrote: > Assuming you need to work with quorum in a non-vnode scenario. That = means that if 2 nodes in a row in the ring are down some number of = quorum operations will fail with UnavailableException (TimeoutException = right after the failures). This is because the for a given range of = tokens quorum will be impossible, but quorum will be possible for = others. >=20 > In a vnode world if any two nodes are down, then the intersection of = vnode token ranges they have are unavailable.=20 >=20 > I think it is two sides of the same coin.=20 >=20 >=20 > On Mon, Dec 10, 2012 at 7:41 AM, Richard Low wrote: > Hi Tyler, >=20 > You're right, the math does assume independence which is unlikely to = be accurate. But if you do have correlated failure modes e.g. same = power, racks, DC, etc. then you can still use Cassandra's rack-aware or = DC-aware features to ensure replicas are spread around so your cluster = can survive the correlated failure mode. So I would expect vnodes to = improve uptime in all scenarios, but haven't done the math to prove it. >=20 > Richard. >=20 >=20 >=20 >=20 >=20 > --=20 > Richard Low > Acunu | http://www.acunu.com | @acunu >=20 --Apple-Mail=_B9309843-975B-45C7-8E54-498DB784FF34 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Is it possible to configure or write a = snitch that would create separate distribution zones within the cluster? =  (e.g. 144 nodes in cluster, split into 12 zones.  Data stored = to node 1 could only be replicated to one of 11 other nodes in the same = distribution zone).
This is kind of what NTS does if = you have nodes in different racks. 

A = replica is placed in each rack, and the process wraps around and = continues until RF replicas are located. If the number of racks is not = equal to the RF you then get some unevenness (how what do you know, = that's a real word :) = ) 

Cheers

http://www.thelastpickle.com

On 12/12/2012, at 6:42 AM, Eric Parusel <ericparusel@gmail.com> = wrote:

Ok, thanks Richard.  That's good to = hear.

However, I still contend that as node count = increases to infinity, the probability of there being at least two node = failures in the cluster at any time would increase to 100%.

I think of this as somewhat analogous to = RAID -- I would not be comfortable with a 144+ disk RAID 6 array, no = matter the rebuild speed :)

Is it possible to = configure or write a snitch that would create separate distribution = zones within the cluster?  (e.g. 144 nodes in cluster, split into = 12 zones.  Data stored to node 1 could only be replicated to one of = 11 other nodes in the same distribution zone).


On Tue, = Dec 11, 2012 at 3:24 AM, Richard Low <rlow@acunu.com> wrote:
Hi Eric,

The time to recover one node is limited by = that node, but the time to recover that's most important is just the = time to replicate the data that is missing from that node.  This is = the removetoken operation (called removenode in 1.2), and this gets = faster the more nodes you have.

Richard.


On 11 December 2012 08:39, Eric Parusel <ericparusel@gmail.com> wrote:
Thanks for your = thoughts guys.

I agree that with vnodes total = downtime is lessened.  Although it also seems that the total number = of outages (however small) would be greater.

But I think downtime is only lessened up to a certain cluster = size.

I'm thinking that as the cluster = continues to grow:
  - node rebuild time will max = out (a single node only has so much write bandwidth)
  - the probability of 2 nodes being down at any given time = will continue to increase -- even if you consider only non-correlated = failures.

Therefore, when adding nodes = beyond the point where node rebuild time maxes out, both the total = number of outages *and* overall downtime would increase?
=

Thanks,
Eric

<= br>


On Mon, Dec 10, 2012 at 7:00 AM, Edward Capriolo = <edlinuxguru@gmail.com> wrote:
Assuming you need = to work with quorum in a non-vnode scenario. That means that if 2 nodes = in a row in the ring are down some number of quorum operations will fail = with UnavailableException (TimeoutException right after the failures). = This is because the for a given range of tokens quorum will be = impossible, but quorum will be possible for others.

In a vnode world if any two nodes are down, =  then the intersection of vnode token ranges they have are = unavailable. 

I think it is two sides of = the same coin. 


On Mon, Dec 10, 2012 at 7:41 AM, Richard = Low <rlow@acunu.com> wrote:
Hi Tyler,

You're right, the math does assume = independence which is unlikely to be accurate.  But if you do have = correlated failure modes e.g. same power, racks, DC, etc. then you can = still use Cassandra's rack-aware or DC-aware features to ensure replicas = are spread around so your cluster can survive the correlated failure = mode.  So I would expect vnodes to improve uptime in all scenarios, = but haven't done the math to prove it.

Richard.





-- =
Richard Low
Acunu | http://www.acunu.com | @acunu


= --Apple-Mail=_B9309843-975B-45C7-8E54-498DB784FF34--