However,=
I still contend that as node count increases to infinity, the probability =
of there being at least two node failures in the cluster at any time would =
increase to 100%.

I think of this as somewhat=A0analogous=A0to RAID -- I =
would not be comfortable with a 144+ disk RAID 6 array, no matter the rebui=
ld speed :)

Is it possible to configure or write a=
snitch that would create separate distribution zones within the cluster? =
=A0(e.g. 144 nodes in cluster, split into 12 zones. =A0Data stored to node =
1 could only be replicated to one of 11 other nodes in the same distributio=
n zone).

On Tue, Dec 1=
1, 2012 at 3:24 AM, Richard Low <rlow@acunu.com> wrote:

Hi Eric,

The time to recover one node is limited by that=
node, but the time to recover that's most important is just the time t=
o replicate the data that is missing from that node. =A0This is the removet=
oken operation (called removenode in 1.2), and this gets faster the more no=
des you have.

Richard.

=

<=
/div>

On 11 December 20=
12 08:39, Eric Parusel <ericparusel@gmail.com> wrote:

Thanks for your thoughts guys.

I agree that with vnodes total downtime is lessened. =A0Al=
though it also seems that the total number of outages (however small) would=
be greater.

But I think downtime is only lessened up to a certain cluster si=
ze.

I'm thinking that as the cluster continues=
to grow:

=A0 - node rebuild time will max out (a single nod=
e only has so much write bandwidth)

=A0 - the probability of 2 nodes being down at any given time will con=
tinue to increase -- even if you consider only non-correlated failures.

Therefore, when adding nodes beyond the point w=
here node rebuild time maxes out, both the total number of outages *and* ov=
erall downtime would increase?

Thanks,

Eric

On Mon, Dec 10, 2012 at 7:00 AM, Edward Capriolo <edl=
inuxguru@gmail.com> wrote:

Assuming you need to work with quorum i= n a non-vnode scenario. That means that if 2 nodes in a row in the ring are= down some number of quorum operations will fail with UnavailableException = (TimeoutException right after the failures). This is because the for a give= n range of tokens quorum will be impossible, but quorum will be possible fo= r others.In a vnode world if any two nodes are down, =A0then the= intersection of vnode token ranges they have are unavailable.=A0

I think it is two sides of the same coin.=A0

On Mon, Dec 10, 2012 at 7:41 AM, Richard Low=
<=
rlow@acunu.com> wrote:

Hi Tyler,You're right, the math does assume in= dependence which is unlikely to be accurate. =A0But if you do have correlat= ed failure modes e.g. same power, racks, DC, etc. then you can still use Ca= ssandra's rack-aware or DC-aware features to ensure replicas are spread= around so your cluster can survive the correlated failure mode. =A0So I wo= uld expect vnodes to improve uptime in all scenarios, but haven't done = the math to prove it.Richard.