Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E21299C9B for ; Mon, 23 Jan 2012 08:56:58 +0000 (UTC) Received: (qmail 16661 invoked by uid 500); 23 Jan 2012 08:56:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15430 invoked by uid 500); 23 Jan 2012 08:56:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15420 invoked by uid 99); 23 Jan 2012 08:56:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jan 2012 08:56:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a42.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jan 2012 08:56:09 +0000 Received: from homiemail-a42.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a42.g.dreamhost.com (Postfix) with ESMTP id 1BDF568C05D for ; Mon, 23 Jan 2012 00:55:48 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=nk1jzoZidQ GBTnHTBoJB40xrH2jY4N3p2j5a/dPoja9bCBXVFn1KQbAQ+bx0D25evBpZhcFXJd /LeQesslhfK/9Dq9hz+n0GfEvh8vlONNLWgJY33UvNQXmB29+dPZJj+vZ8z/0xYG Sy4UtWgK0M6DKCOkXfPDNzSAU508/TcTs= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=hHLFdn4EeGPsLbes Sc2IVQbOGGM=; b=MDhY6R+KYS0/9CMPzyhjGpfhW39GCy5qjFPICclKHDLEAUmP f06nkF/r9RX4IA9BBnpMVLcOocA/lg4w0jHV8hkQqSPdfwOZUe/SAdWNhVH14CCL eawgrCQ0QEYp/Gm4m5NJxEkab6iMfbMr+/2lwdrc/aHHV6W2DAxQgK6XAco= Received: from [172.16.1.5] (125-236-193-159.adsl.xtra.co.nz [125.236.193.159]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a42.g.dreamhost.com (Postfix) with ESMTPSA id 6DC1168C05B for ; Mon, 23 Jan 2012 00:55:47 -0800 (PST) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: multipart/alternative; boundary="Apple-Mail=_61AC350E-6C08-431E-AE15-6FB5600B8D8C" Subject: Re: ideal cluster size Date: Mon, 23 Jan 2012 21:55:44 +1300 In-Reply-To: <4F1B55DA.1080105@rightscale.com> To: user@cassandra.apache.org References: <4F18F8EE.40608@rightscale.com> <4F1A4070.8030605@bnl.gov> <4F1B39D5.9010001@rightscale.com> <4F1B55DA.1080105@rightscale.com> Message-Id: <8CCBD62D-3609-411B-98B4-B8B20FB05691@thelastpickle.com> X-Mailer: Apple Mail (2.1251.1) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_61AC350E-6C08-431E-AE15-6FB5600B8D8C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii I second Peters point, big servers are not always the best.=20 My experience (using spinning disks) is that 200 to 300 GB of live data = load per node (including replicated data) is a sweet spot. Above this = the time taken for compaction, repair, off node backups, node moves etc = starts to be a pain.=20 Also, suffering catastrophic failure of 1 node in 100 is a better = situation that 1 node in 16.=20 Finally, when you have more servers with less high performance disks you = also get more memory and more CPU cores.=20 (I'm obviously ignoring all the ops side here, automate with chef or = http://www.datastax.com/products/opscenter ).=20 wrt failure modes I wrote this last year, it's about single DC = deployments but you can probably work it out for multi-dc = http://thelastpickle.com/2011/06/13/Down-For-Me/ Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/01/2012, at 1:18 PM, Thorsten von Eicken wrote: > Good point. One thing I'm wondering about cassandra is what happens = when > there is a massive failure. For example, if 1/3 of the nodes go down = or > become unreachable. This could happen in EC2 if an AZ has a failure, = or > in a datacenter if a whole rack or UPS goes dark. I'm not so concerned > about the time where the nodes are down. If I understand replication, > consistency, ring, and such I can architect things such that what must > continue running does continue. >=20 > What I'm concerned about is when these nodes all come back up or > reconnect. I have a hard time figuring out what exactly happens other > than the fact that hinted handoffs get processed. Are the restarted > nodes handling reads during that time? If so, they could serve up > massive amounts of stale data, no? Do they then all start a repair, or > is this something that needs to be run manually? If many do a repair = at > the same time, do I effectively end up with a down cluster due to the > repair load? If no node was lost, is a repair required or are the = hinted > handoffs sufficient? >=20 > Is there a manual or wiki section that discusses some of this and I = just > missed it? >=20 > On 1/21/2012 2:25 PM, Peter Schuller wrote: >>> Thanks for the responses! We'll definitely go for powerful servers = to >>> reduce the total count. Beyond a dozen servers there really doesn't = seem >>> to be much point in trying to increase count anymore for >> Just be aware that if "big" servers imply *lots* of data (especially >> in relation to memory size), it's not necessarily the best trade-off. >> Consider the time it takes to do repairs, streaming, node start-up, >> etc. >>=20 >> If it's only about CPU resources then bigger nodes probably make more >> sense if the h/w is cost effective. >>=20 --Apple-Mail=_61AC350E-6C08-431E-AE15-6FB5600B8D8C Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii I = second Peters point, big servers are not always the = best. 

My experience (using spinning disks) is = that 200 to 300 GB of live data load per node (including replicated = data) is a sweet spot. Above this the time taken for compaction, repair, = off node backups, node moves etc starts to be a = pain. 

Also, suffering catastrophic = failure of 1 node in 100 is a better situation that 1 node in = 16. 

Finally, when you have more servers = with less high performance disks you also get more memory and more CPU = cores. 

(I'm obviously ignoring all the = ops side here, automate with chef or http://www.datastax.co= m/products/opscenter ). 

wrt = failure modes I wrote this last year, it's about single DC deployments = but you can probably work it out for multi-dc http://thelastpi= ckle.com/2011/06/13/Down-For-Me/

Hope that = helps.

http://www.thelastpickle.com

On 22/01/2012, at 1:18 PM, Thorsten von Eicken = wrote:

Good point. One thing I'm wondering about cassandra = is what happens when
there is a massive failure. For example, if 1/3 = of the nodes go down or
become unreachable. This could happen in EC2 = if an AZ has a failure, or
in a datacenter if a whole rack or UPS = goes dark. I'm not so concerned
about the time where the nodes are = down. If I understand replication,
consistency, ring, and such I can = architect things such that what must
continue running does = continue.

What I'm concerned about is when these nodes all come = back up or
reconnect. I have a hard time figuring out what exactly = happens other
than the fact that hinted handoffs get processed. Are = the restarted
nodes handling reads during that time? If so, they = could serve up
massive amounts of stale data, no? Do they then all = start a repair, or
is this something that needs to be run manually? = If many do a repair at
the same time, do I effectively end up with a = down cluster due to the
repair load? If no node was lost, is a repair = required or are the hinted
handoffs sufficient?

Is there a = manual or wiki section that discusses some of this and I just
missed = it?

On 1/21/2012 2:25 PM, Peter Schuller wrote:
Thanks for the responses! We'll = definitely go for powerful servers = to
reduce the total count. Beyond a dozen servers there = really doesn't seem
to be much point in trying to = increase count anymore for
Just be aware that if "big" servers imply *lots* of data = (especially
in relation to = memory size), it's not necessarily the best = trade-off.
Consider the time = it takes to do repairs, streaming, node = start-up,
etc.

If it's only = about CPU resources then bigger nodes probably make = more
sense if the h/w is cost = effective.


<= /html>= --Apple-Mail=_61AC350E-6C08-431E-AE15-6FB5600B8D8C--