cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten von Eicken <>
Subject Re: ideal cluster size
Date Sun, 22 Jan 2012 00:18:34 GMT
Good point. One thing I'm wondering about cassandra is what happens when
there is a massive failure. For example, if 1/3 of the nodes go down or
become unreachable. This could happen in EC2 if an AZ has a failure, or
in a datacenter if a whole rack or UPS goes dark. I'm not so concerned
about the time where the nodes are down. If I understand replication,
consistency, ring, and such I can architect things such that what must
continue running does continue.

What I'm concerned about is when these nodes all come back up or
reconnect. I have a hard time figuring out what exactly happens other
than the fact that hinted handoffs get processed. Are the restarted
nodes handling reads during that time? If so, they could serve up
massive amounts of stale data, no? Do they then all start a repair, or
is this something that needs to be run manually? If many do a repair at
the same time, do I effectively end up with a down cluster due to the
repair load? If no node was lost, is a repair required or are the hinted
handoffs sufficient?

Is there a manual or wiki section that discusses some of this and I just
missed it?

On 1/21/2012 2:25 PM, Peter Schuller wrote:
>> Thanks for the responses! We'll definitely go for powerful servers to
>> reduce the total count. Beyond a dozen servers there really doesn't seem
>> to be much point in trying to increase count anymore for
> Just be aware that if "big" servers imply *lots* of data (especially
> in relation to memory size), it's not necessarily the best trade-off.
> Consider the time it takes to do repairs, streaming, node start-up,
> etc.
> If it's only about CPU resources then bigger nodes probably make more
> sense if the h/w is cost effective.

View raw message