Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B6A73D1E1 for ; Mon, 10 Dec 2012 12:42:08 +0000 (UTC) Received: (qmail 79029 invoked by uid 500); 10 Dec 2012 12:42:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 78277 invoked by uid 500); 10 Dec 2012 12:42:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 78226 invoked by uid 99); 10 Dec 2012 12:42:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 12:42:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rlow@acunu.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vb0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 12:41:53 +0000 Received: by mail-vb0-f44.google.com with SMTP id fc26so2649030vbb.31 for ; Mon, 10 Dec 2012 04:41:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=acunu.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=QrecBFQBSsRvQwS4rZZIZyJk7AkbbFcZrncjwCvXWYk=; b=b9heZ+0gfCQAHkeBnBDfHerhqtHSnjEry/PWM/1ye9KMl6MsiyzMcY75wbZrm8diDp 4+FzNmP6ycyGoSTXgI/ybB1HXBJIWbwVAZKq98QMJTt8jWvoMWxDsH6JAO0KqfXWbGn8 YFSe9ZZmPiQPdZXC2kxOV7mayNDar3YbISMTw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=QrecBFQBSsRvQwS4rZZIZyJk7AkbbFcZrncjwCvXWYk=; b=pelJnpXW0rnA7MNTFSDONX3iqJ9U/xOFW+7U8vNkJT6MYs6ZSC0RfqqblVXcmLD/Mp ZlkpKs4UCbuRAO1R/LqHGISaskUDudovoOwhZoVd9L20o9NQBAyv5rtG9OeZ6ESpybAc 78k0aU9EVVOwTm2NtJVTc1l5b7A/64ejzHdM/U39ksYqW8R7K3RyJJbPSGUmy9/Dbdq+ innKV4NN6WgEhW/xdZOVQ7ti4jCWgozN1Y4iLh32FcrxzdF+YFLSnyG3LKZSCyGOi2PE 5mJ3VmDxem8rQ3Oevu43SmAOwzJn/Po0fiykcF4LNspanNufoheyjncgR+fIJn1Ku5QT 8E+w== Received: by 10.220.225.132 with SMTP id is4mr8678963vcb.47.1355143292418; Mon, 10 Dec 2012 04:41:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.59.9.1 with HTTP; Mon, 10 Dec 2012 04:41:11 -0800 (PST) In-Reply-To: References: From: Richard Low Date: Mon, 10 Dec 2012 12:41:11 +0000 Message-ID: Subject: Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=14dae9ccd52c90790c04d07ee207 X-Gm-Message-State: ALoCoQlO/Tzi9pekFZh3sOSzB9Jm7cIASuseVgQJ8CLWk8lJx3T1Y+Tr0YLs1VPpAs1koWH2bxZl X-Virus-Checked: Checked by ClamAV on apache.org --14dae9ccd52c90790c04d07ee207 Content-Type: text/plain; charset=ISO-8859-1 Hi Tyler, You're right, the math does assume independence which is unlikely to be accurate. But if you do have correlated failure modes e.g. same power, racks, DC, etc. then you can still use Cassandra's rack-aware or DC-aware features to ensure replicas are spread around so your cluster can survive the correlated failure mode. So I would expect vnodes to improve uptime in all scenarios, but haven't done the math to prove it. Richard. On 9 December 2012 17:50, Tyler Hobbs wrote: > Nicolas, > > Strictly speaking, your math makes the assumption that the failure of > different nodes are probabilistically independent events. This is, of > course, not a accurate assumption for real world conditions. Nodes share > racks, networking equipment, power, availability zones, data centers, etc. > So, I think the mathematical assertion is not quite as strong as one would > like, but it's certainly a good argument for handling certain types of node > failures. > > > On Fri, Dec 7, 2012 at 11:27 AM, Nicolas Favre-Felix wrote: > >> Hi Eric, >> >> Your concerns are perfectly valid. >> >> We (Acunu) led the design and implementation of this feature and spent a >> long time looking at the impact of such a large change. >> We summarized some of our notes and wrote about the impact of virtual >> nodes on cluster uptime a few months back: >> http://www.acunu.com/2/post/2012/10/improving-cassandras-uptime-with-virtual-nodes.html >> . >> The main argument in this blog post is that you only have a failure to >> perform quorum read/writes if at least RF replicas fail within the time it >> takes to rebuild the first dead node. We show that virtual nodes actually >> decrease the probability of failure, by streaming data from all nodes and >> thereby improving the rebuild time. >> >> Regards, >> >> Nicolas >> >> >> On Wed, Dec 5, 2012 at 4:45 PM, Eric Parusel wrote: >> >>> Hi all, >>> >>> I've been wondering about virtual nodes and how cluster uptime might >>> change as cluster size increases. >>> >>> I understand clusters will benefit from increased reliability due to >>> faster rebuild time, but does that hold true for large clusters? >>> >>> It seems that since (and correct me if I'm wrong here) every physical >>> node will likely share some small amount of data with every other node, >>> that as the count of physical nodes in a Cassandra cluster increases (let's >>> say into the triple digits) that the probability of at least one failure to >>> Quorum read/write occurring in a given time period would *increase*. >>> >>> Would this hold true, at least until physical nodes becomes greater than >>> num_tokens per node? >>> >>> I understand that the window of failure for affected ranges would >>> probably be small but we do Quorum reads of many keys, so we'd likely hit >>> every virtual range with our queries, even if num_tokens was 256. >>> >>> Thanks, >>> Eric >>> >> >> > > > -- > Tyler Hobbs > DataStax > > -- Richard Low Acunu | http://www.acunu.com | @acunu --14dae9ccd52c90790c04d07ee207 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Tyler,

You're right, the math does assume indepen= dence which is unlikely to be accurate. =A0But if you do have correlated fa= ilure modes e.g. same power, racks, DC, etc. then you can still use Cassand= ra's rack-aware or DC-aware features to ensure replicas are spread arou= nd so your cluster can survive the correlated failure mode. =A0So I would e= xpect vnodes to improve uptime in all scenarios, but haven't done the m= ath to prove it.

Richard.


On 9 December= 2012 17:50, Tyler Hobbs <tyler@datastax.com> wrote:
Nicolas,

Strictly speaking, your math makes the assumption that the = failure of different nodes are probabilistically independent events. This i= s, of course, not a accurate assumption for real world conditions.=A0 Nodes= share racks, networking equipment, power, availability zones, data centers= , etc.=A0 So, I think the mathematical assertion is not quite as strong as = one would like, but it's certainly a good argument for handling certain= types of node failures.


On Fri, Dec 7, 2012 at 11:27 AM, Nicolas Favre-Felix <nicolas@ac= unu.com> wrote:
Hi Eric,

Your concerns ar= e perfectly valid.

We (Acunu) led the design and i= mplementation of this feature and spent a long time looking at the impact o= f such a large change.
We summarized some of our notes and wrote about the impact of virtual = nodes on cluster uptime a few months back:=A0http://www.acunu.com/2/post/2012/10/improving-cassandras-uptim= e-with-virtual-nodes.html.
The main argument in this blog post is that you only have a failure to= perform quorum read/writes if at least RF replicas fail within the time it= takes to rebuild the first dead node.=A0We show that virtual nodes actuall= y decrease the probability of failure, by streaming data from all nodes and= thereby improving the rebuild time.

Regards,

Nicolas


On Wed, Dec 5, 2012 at 4:45 PM, E= ric Parusel <ericparusel@gmail.com> wrote:
Hi all,

I've been won= dering about virtual nodes and how cluster uptime might change as cluster s= ize increases.

I understand clusters will benefit from increased relia= bility due to faster rebuild time, but does that hold true for large cluste= rs?

It seems that since (and correct me if I'm wrong he= re) every physical node will likely share some small amount of data with ev= ery other node, that as the count of physical nodes in a Cassandra cluster = increases (let's say into the triple digits) that the probability of at= least one failure to Quorum read/write=A0occurring=A0in a given time perio= d=A0would *increase*. =A0

Would this hold true, at least until physical nodes bec= omes greater than num_tokens per node?

I under= stand that the window of failure for affected ranges would probably be smal= l but we do Quorum reads of many keys, so we'd likely hit every virtual= range with our queries, even if num_tokens was 256.

Thanks,
Eric




--
Tyl= er Hobbs
DataStax
<= br>



--
Richard Low<= br>Acunu | http://www.ac= unu.com | @acunu
--14dae9ccd52c90790c04d07ee207--