From user-return-30551-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Dec 11 17:42:39 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28A8CD294 for ; Tue, 11 Dec 2012 17:42:39 +0000 (UTC) Received: (qmail 45782 invoked by uid 500); 11 Dec 2012 17:42:36 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 45763 invoked by uid 500); 11 Dec 2012 17:42:36 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 45726 invoked by uid 99); 11 Dec 2012 17:42:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 17:42:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ericparusel@gmail.com designates 209.85.210.178 as permitted sender) Received: from [209.85.210.178] (HELO mail-ia0-f178.google.com) (209.85.210.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 17:42:27 +0000 Received: by mail-ia0-f178.google.com with SMTP id k25so7237739iah.9 for ; Tue, 11 Dec 2012 09:42:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=CnKF8XIPxo9zFzDS5yMMSxzIuRP26Hek0BkSNHEdpOA=; b=vepM4XSy1rtRJ3sQIlnfwgdC4Wh0WjwePEN5rsT46CYMKGx8uGrAt5AW4VwegWFrYS YDaBedSDxTcRAamdJbO7JLefTrI+uSMj1jnoodRZcJpVZ0R8+QQxDQNNKeBaJDmzWZtb tmxmh6GYKVoaOy1JuCtWV0C2lVtVDfr/CFqohYXvCq+xBPcBIYu3KsEbChUwZr55lYd8 FT4vxExDmI9nHRHtrEQxGqbLrd+IQM75Ug9OsZEyhjTwU2kT79II1V0qnGGxwZMAgKW6 BE2d+14vR9lWgTNVXYTU7G91BswyBQyAgiYOLhzZkkY0o6w5d6kWYpvxs/9KH0XS3zcP P7Uw== MIME-Version: 1.0 Received: by 10.50.151.238 with SMTP id ut14mr10807964igb.72.1355247726620; Tue, 11 Dec 2012 09:42:06 -0800 (PST) Received: by 10.64.78.130 with HTTP; Tue, 11 Dec 2012 09:42:06 -0800 (PST) In-Reply-To: References: Date: Tue, 11 Dec 2012 09:42:06 -0800 Message-ID: Subject: Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count? From: Eric Parusel To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e89a8f3b9f7d53ee6004d097331f X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f3b9f7d53ee6004d097331f Content-Type: text/plain; charset=ISO-8859-1 Ok, thanks Richard. That's good to hear. However, I still contend that as node count increases to infinity, the probability of there being at least two node failures in the cluster at any time would increase to 100%. I think of this as somewhat analogous to RAID -- I would not be comfortable with a 144+ disk RAID 6 array, no matter the rebuild speed :) Is it possible to configure or write a snitch that would create separate distribution zones within the cluster? (e.g. 144 nodes in cluster, split into 12 zones. Data stored to node 1 could only be replicated to one of 11 other nodes in the same distribution zone). On Tue, Dec 11, 2012 at 3:24 AM, Richard Low wrote: > Hi Eric, > > The time to recover one node is limited by that node, but the time to > recover that's most important is just the time to replicate the data that > is missing from that node. This is the removetoken operation (called > removenode in 1.2), and this gets faster the more nodes you have. > > Richard. > > > On 11 December 2012 08:39, Eric Parusel wrote: > >> Thanks for your thoughts guys. >> >> I agree that with vnodes total downtime is lessened. Although it also >> seems that the total number of outages (however small) would be greater. >> >> But I think downtime is only lessened up to a certain cluster size. >> >> I'm thinking that as the cluster continues to grow: >> - node rebuild time will max out (a single node only has so much write >> bandwidth) >> - the probability of 2 nodes being down at any given time will continue >> to increase -- even if you consider only non-correlated failures. >> >> Therefore, when adding nodes beyond the point where node rebuild time >> maxes out, both the total number of outages *and* overall downtime would >> increase? >> >> Thanks, >> Eric >> >> >> >> >> On Mon, Dec 10, 2012 at 7:00 AM, Edward Capriolo wrote: >> >>> Assuming you need to work with quorum in a non-vnode scenario. That >>> means that if 2 nodes in a row in the ring are down some number of quorum >>> operations will fail with UnavailableException (TimeoutException right >>> after the failures). This is because the for a given range of tokens quorum >>> will be impossible, but quorum will be possible for others. >>> >>> In a vnode world if any two nodes are down, then the intersection of >>> vnode token ranges they have are unavailable. >>> >>> I think it is two sides of the same coin. >>> >>> >>> On Mon, Dec 10, 2012 at 7:41 AM, Richard Low wrote: >>> >>>> Hi Tyler, >>>> >>>> You're right, the math does assume independence which is unlikely to be >>>> accurate. But if you do have correlated failure modes e.g. same power, >>>> racks, DC, etc. then you can still use Cassandra's rack-aware or DC-aware >>>> features to ensure replicas are spread around so your cluster can survive >>>> the correlated failure mode. So I would expect vnodes to improve uptime in >>>> all scenarios, but haven't done the math to prove it. >>>> >>>> Richard. >>>> >>> >>> >> > > > -- > Richard Low > Acunu | http://www.acunu.com | @acunu > --e89a8f3b9f7d53ee6004d097331f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ok, thanks Richard. =A0That's good to hear.

However,= I still contend that as node count increases to infinity, the probability = of there being at least two node failures in the cluster at any time would = increase to 100%.

I think of this as somewhat=A0analogous=A0to RAID -- I = would not be comfortable with a 144+ disk RAID 6 array, no matter the rebui= ld speed :)

Is it possible to configure or write a= snitch that would create separate distribution zones within the cluster? = =A0(e.g. 144 nodes in cluster, split into 12 zones. =A0Data stored to node = 1 could only be replicated to one of 11 other nodes in the same distributio= n zone).


On Tue, Dec 1= 1, 2012 at 3:24 AM, Richard Low <rlow@acunu.com> wrote:
Hi Eric,

The time to recover one node is limited by that= node, but the time to recover that's most important is just the time t= o replicate the data that is missing from that node. =A0This is the removet= oken operation (called removenode in 1.2), and this gets faster the more no= des you have.

Richard.
=


On 11 December 20= 12 08:39, Eric Parusel <ericparusel@gmail.com> wrote:
Thanks for your thoughts guys.

I agree that with vnodes total downtime is lessened. =A0Al= though it also seems that the total number of outages (however small) would= be greater.

But I think downtime is only lessened up to a certain cluster si= ze.

I'm thinking that as the cluster continues= to grow:
=A0 - node rebuild time will max out (a single nod= e only has so much write bandwidth)
=A0 - the probability of 2 nodes being down at any given time will con= tinue to increase -- even if you consider only non-correlated failures.

Therefore, when adding nodes beyond the point w= here node rebuild time maxes out, both the total number of outages *and* ov= erall downtime would increase?

Thanks,
Eric



On Mon, Dec 10, 2012 at 7:00 AM, Edward Capriolo <edl= inuxguru@gmail.com> wrote:
Assuming you need to work with quorum i= n a non-vnode scenario. That means that if 2 nodes in a row in the ring are= down some number of quorum operations will fail with UnavailableException = (TimeoutException right after the failures). This is because the for a give= n range of tokens quorum will be impossible, but quorum will be possible fo= r others.

In a vnode world if any two nodes are down, =A0then the= intersection of vnode token ranges they have are unavailable.=A0

I think it is two sides of the same coin.=A0


On Mon, Dec 10, 2012 at 7:41 AM, Richard Low= <= rlow@acunu.com> wrote:
Hi Tyler,

You're right, the math does assume in= dependence which is unlikely to be accurate. =A0But if you do have correlat= ed failure modes e.g. same power, racks, DC, etc. then you can still use Ca= ssandra's rack-aware or DC-aware features to ensure replicas are spread= around so your cluster can survive the correlated failure mode. =A0So I wo= uld expect vnodes to improve uptime in all scenarios, but haven't done = the math to prove it.

Richard.





<= /div>
--
Richard Low
Acunu | http://www.acunu.com | @acunu

--e89a8f3b9f7d53ee6004d097331f--