cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: ideal cluster size
Date Mon, 23 Jan 2012 08:55:44 GMT
I second Peters point, big servers are not always the best. 

My experience (using spinning disks) is that 200 to 300 GB of live data load per node (including
replicated data) is a sweet spot. Above this the time taken for compaction, repair, off node
backups, node moves etc starts to be a pain. 

Also, suffering catastrophic failure of 1 node in 100 is a better situation that 1 node in

Finally, when you have more servers with less high performance disks you also get more memory
and more CPU cores. 

(I'm obviously ignoring all the ops side here, automate with chef or

wrt failure modes I wrote this last year, it's about single DC deployments but you can probably
work it out for multi-dc

Hope that helps.

Aaron Morton
Freelance Developer

On 22/01/2012, at 1:18 PM, Thorsten von Eicken wrote:

> Good point. One thing I'm wondering about cassandra is what happens when
> there is a massive failure. For example, if 1/3 of the nodes go down or
> become unreachable. This could happen in EC2 if an AZ has a failure, or
> in a datacenter if a whole rack or UPS goes dark. I'm not so concerned
> about the time where the nodes are down. If I understand replication,
> consistency, ring, and such I can architect things such that what must
> continue running does continue.
> What I'm concerned about is when these nodes all come back up or
> reconnect. I have a hard time figuring out what exactly happens other
> than the fact that hinted handoffs get processed. Are the restarted
> nodes handling reads during that time? If so, they could serve up
> massive amounts of stale data, no? Do they then all start a repair, or
> is this something that needs to be run manually? If many do a repair at
> the same time, do I effectively end up with a down cluster due to the
> repair load? If no node was lost, is a repair required or are the hinted
> handoffs sufficient?
> Is there a manual or wiki section that discusses some of this and I just
> missed it?
> On 1/21/2012 2:25 PM, Peter Schuller wrote:
>>> Thanks for the responses! We'll definitely go for powerful servers to
>>> reduce the total count. Beyond a dozen servers there really doesn't seem
>>> to be much point in trying to increase count anymore for
>> Just be aware that if "big" servers imply *lots* of data (especially
>> in relation to memory size), it's not necessarily the best trade-off.
>> Consider the time it takes to do repairs, streaming, node start-up,
>> etc.
>> If it's only about CPU resources then bigger nodes probably make more
>> sense if the h/w is cost effective.

View raw message