Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93D6A41E2 for ; Sat, 9 Jul 2011 23:58:09 +0000 (UTC) Received: (qmail 49140 invoked by uid 500); 9 Jul 2011 23:58:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49111 invoked by uid 500); 9 Jul 2011 23:58:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49103 invoked by uid 99); 9 Jul 2011 23:58:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jul 2011 23:58:06 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a40.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jul 2011 23:58:01 +0000 Received: from homiemail-a40.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a40.g.dreamhost.com (Postfix) with ESMTP id 9E25374C057 for ; Sat, 9 Jul 2011 16:57:39 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=b5tKKj4KZ6 oSU3CFQCoXFdv7OJ6xF4rTTQ5rM4w3wQmJWjdlJOS8Cq8bAUTardS2lZAxq+mQyV ETMv0e5cTeDhVbuImp1hF2kW+Ako86Le1wvp6xQt4ThjI9ufPI0OllunYC66aY2d Vh6zYlCbWDJCt+FpY1pYMUsTFo6Bq5dCw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=JnxVaX0SgGtK8cQw ASMz8nuNI3E=; b=GbdEvibkT2MuI86KThXBO7h24q6g5YIiw4U+V6FmljVj4j+x TXNqe8aN4WxQJ2jDbnkj9Hep0kKb1sy2ZAPZ2HXTwe1JGCUrSAqHxzuZUdBa7c1B gZB3yLlWLeH6X+Ukvg2qzUISuEBMAGoaQjSJkmkBc8DxtBftKnYAmcivWUE= Received: from [10.61.32.132] (unknown [209.119.67.3]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a40.g.dreamhost.com (Postfix) with ESMTPSA id 6EA0C74C064 for ; Sat, 9 Jul 2011 16:57:37 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-1--911308251 Subject: Re: how large cassandra could scale when it need to do manual operation? Date: Sat, 9 Jul 2011 16:57:37 -0700 In-Reply-To: To: user@cassandra.apache.org References: Message-Id: X-Mailer: Apple Mail (2.1084) --Apple-Mail-1--911308251 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > about the decommission problem, here is the link: = http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-de= commission-two-slow-nodes-td5078455.html The key part of that post is "and since the second node was under heavy = load, and not enough ram, it was busy GCing and worked horribly slow" .=20= > maybe I was misunderstanding the replication factor, doesn't it RF=3D3 = means I could lose two nodes and still have one available(with 100% of = the keys), once Nodes>=3D3? When you start losing replicas the CL you use dictates if the cluster is = still up for 100% of the keys. See = http://thelastpickle.com/2011/06/13/Down-For-Me/=20 > I have the strong willing to set RF to a very high value... As chris said 3 is about normal, it means the QUORUM CL is only 2 nodes.=20= > I am also trying to deploy cassandra across two datacenters(with 20ms = latency). Lookup LOCAL_QUORUM in the wiki Hope that helps.=20 ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 02:01, Chris Goffinet wrote: > As mentioned by Aaron, yes we run hundreds of Cassandra nodes across = multiple clusters. We run with RF of 2 and 3 (most common).=20 >=20 > We use commodity hardware and see failure all the time at this scale. = We've never had 3 nodes that were in same replica set, fail all at once. = We mitigate risk by being rack diverse, using different vendors for our = hard drives, designed workflows to make sure machines get serviced in = certain time windows and have an extensive automated burn-in process of = (disk, memory, drives) to not roll out nodes/clusters that could fail = right away. >=20 > On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu = wrote: > thank you very much for the reply. which brings me more confidence on = cassandra. > I will try the automation tools, the examples you've listed seems = quite promising! >=20 >=20 > about the decommission problem, here is the link: = http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-de= commission-two-slow-nodes-td5078455.html > I am also trying to deploy cassandra across two datacenters(with 20ms = latency). so I am worrying about the network latency will even make it = worse. =20 >=20 > maybe I was misunderstanding the replication factor, doesn't it RF=3D3 = means I could lose two nodes and still have one available(with 100% of = the keys), once Nodes>=3D3? besides I am not sure what's twitters = setting on RF, but it is possible to lose 3 nodes in the same = time(facebook once encountered photo loss because there RAID broken, = rarely happen though). I have the strong willing to set RF to a very = high value... >=20 > Thanks! >=20 >=20 > On Sat, Jul 9, 2011 at 5:22 AM, aaron morton = wrote: > AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long = time ago. Twitter is a vocal supporter with a large Apache Cassandra = install, e.g. "Twitter currently runs a couple hundred Cassandra nodes = across a half dozen clusters. " = http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cass= andra-sf-2011 >=20 >=20 > If you are working with a 3 node cluster removing/rebuilding/what ever = one node will effect 33% of your capacity. When you scale up the = contribution from each individual node goes down, and the impact of one = node going down is less. Problems that happen with a few nodes will go = away at scale, to be replaced by a whole set of new ones. =20 >=20 >=20 >> 1): the load balance need to manually performed on every node, = according to:=20 >=20 > Yes > =09 >> 2): when adding new nodes, need to perform node repair and cleanup on = every node=20 >=20 >=20 >=20 >=20 >=20 >=20 > You only need to run cleanup, see = http://wiki.apache.org/cassandra/Operations#Bootstrap >=20 >=20 >=20 >=20 >=20 >=20 >=20 >> 3) when decommission a node, there is a chance that slow down the = entire cluster. (not sure why but I saw people ask around about it.) and = the only way to do is shutdown the entire the cluster, rsync the data, = and start all nodes without the decommission one.=20 >=20 > I cannot remember any specific cases where decommission requires a = full cluster stop, do you have a link? With regard to slowing down, the = decommission process will stream data from the node you are removing = onto the other nodes this can slow down the target node (I think it's = more intelligent now about what is moved). This will be exaggerated in a = 3 node cluster as you are removing 33% of the processing and adding some = (temporary) extra load to the remaining nodes.=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >> after all, I think there is alot of human work to do to maintain the = cluster which make it impossible to scale to thousands of nodes,=20 >=20 > Automation, Automation, Automation is the only way to go.=20 >=20 > Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, = munin, ganglia etc for monitoring. And=20 >=20 >=20 >=20 >=20 >=20 >=20 > Ops Centre (http://www.datastax.com/products/opscenter) for cassandra = specific management. >=20 >=20 >=20 >=20 >=20 >=20 >=20 >> I am totally wrong about all of this, currently I am serving 1 = millions pv every day with Cassandra and it make me feel unsafe, I am = afraid one day one node crash will cause the data broken and all cluster = goes wrong.... >=20 > With RF3 and a 3Node cluster you have room to lose one node and the = cluster will be up for 100% of the keys. While better than having to = worry about *the* database server, it's still entry level fault = tolerance. With RF 3 in a 6 Node cluster you can lose up to 2 nodes and = still be up for 100% of the keys.=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 > Is there something you are specifically concerned about with your = current installation ?=20 >=20 > Cheers >=20 >=20 >=20 >=20 >=20 >=20 >=20 > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com >=20 > On 8 Jul 2011, at 08:50, Yan Chunlu wrote: >=20 >> hi, all: >> I am curious about how large that Cassandra can scale?=20 >>=20 >> from the information I can get, the largest usage is at facebook, = which is about 150 nodes. in the mean time they are using 2000+ nodes = with Hadoop, and yahoo even using 4000 nodes of Hadoop.=20 >>=20 >> I am not understand why is the situation, I only have little = knowledge with Cassandra and even no knowledge with Hadoop.=20 >>=20 >>=20 >>=20 >> currently I am using cassandra with 3 nodes and having problem bring = one back after it out of sync, the problems I encountered making me = worry about how cassandra could scale out:=20 >>=20 >> 1): the load balance need to manually performed on every node, = according to:=20 >>=20 >> def tokens(nodes):=20 >>=20 >> for x in xrange(nodes):=20 >>=20 >> print 2 ** 127 / nodes * x=20 >>=20 >>=20 >>=20 >> 2): when adding new nodes, need to perform node repair and cleanup on = every node=20 >>=20 >>=20 >>=20 >> 3) when decommission a node, there is a chance that slow down the = entire cluster. (not sure why but I saw people ask around about it.) and = the only way to do is shutdown the entire the cluster, rsync the data, = and start all nodes without the decommission one.=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> after all, I think there is alot of human work to do to maintain the = cluster which make it impossible to scale to thousands of nodes, but I = hope I am totally wrong about all of this, currently I am serving 1 = millions pv every day with Cassandra and it make me feel unsafe, I am = afraid one day one node crash will cause the data broken and all cluster = goes wrong....=20 >>=20 >>=20 >>=20 >> in the contrary, relational database make me feel safety but it does = not scale well.=20 >>=20 >>=20 >>=20 >> thanks for any guidance here. >>=20 >=20 >=20 >=20 >=20 > --=20 > Charles >=20 --Apple-Mail-1--911308251 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
The key part of that post is "and since the = second node was under heavy load, and not enough ram, it was busy GCing = and worked horribly slow" . 

maybe I was misunderstanding the = replication factor, doesn't it RF=3D3 means I could lose two nodes and = still have one available(with 100% of the keys), once = Nodes>=3D3?
When you start losing = replicas the CL you use dictates if the cluster is still up for 100% of = the keys. See http://thelastpi= ckle.com/2011/06/13/Down-For-Me/ 

 I have the strong willing to = set RF to a very high value...
As chris = said 3 is about normal, it means the QUORUM CL is only 2 = nodes. 

I = am also trying to deploy cassandra across two datacenters(with 20ms = latency).
Lookup LOCAL_QUORUM in the wiki

Hope = that helps. 

http://www.thelastpickle.com

On 9 Jul 2011, at 02:01, Chris Goffinet wrote:

As = mentioned by Aaron, yes we run hundreds of Cassandra nodes across = multiple clusters. We run with RF of 2 and 3 (most = common). 

We use commodity hardware and see = failure all the time at this scale. We've never had 3 nodes that were in = same replica set, fail all at once. We mitigate risk by being rack = diverse, using different vendors for our hard drives, designed workflows = to make sure machines get serviced in certain time windows and have an = extensive automated burn-in process of (disk, memory, drives) to not = roll out nodes/clusters that could fail right away.

On Sat, Jul 9, 2011 at 12:17 AM, Yan = Chunlu <springrider@gmail.com>= wrote:
thank you very much for the reply. which brings me more confidence on = cassandra.
I will try the automation tools, the examples you've = listed seems quite promising!


 I am also trying to deploy cassandra across two = datacenters(with 20ms latency). so I am worrying about the network = latency will even make it worse.  

maybe I = was misunderstanding the replication factor, doesn't it RF=3D3 means I = could lose two nodes and still have one available(with 100% of the = keys), once Nodes>=3D3?   besides I am not sure what's twitters = setting on RF, but it is possible to lose 3 nodes in the same = time(facebook once encountered photo loss because there RAID broken, = rarely happen though). I have the strong willing to set RF to a very = high value...
=

Thanks!


On Sat, Jul 9, 2011 at 5:22 = AM, aaron morton <aaron@thelastpickle.com> wrote:
AFAIK Facebook Cassandra and Apache = Cassandra diverged paths a long time ago. Twitter is a vocal supporter = with a large Apache Cassandra install, e.g. "Twitter currently runs = a couple hundred Cassandra nodes across a half dozen = clusters. " http://www.datastax.com/2011/06/chris-goffinet-of-twitte= r-to-speak-at-cassandra-sf-2011


If you are working with a 3 = node cluster removing/rebuilding/what ever one node will effect 33% of = your capacity. When you scale up the contribution from each individual = node goes down, and the impact of one node going down is less. Problems = that happen with a few nodes will go away at scale, to be replaced by a = whole set of new ones.   


1):  the load balance = need to manually performed on every node, according to: 
Yes
2): when adding new = nodes, need to perform node repair and cleanup on every = node 
You only need to run = cleanup, see http://wiki.apache.org/cassandra/Operations#Bootstrap

3) when decommission a = node, there is a chance that slow down the entire cluster. (not sure why = but I saw people ask around about it.) and the only way to do is = shutdown the entire the cluster, rsync the data, and start all nodes = without the decommission one. 
I cannot remember any specific cases = where decommission requires a full cluster stop, do you have a link? = With regard to slowing down, the decommission process will stream data = from the node you are removing onto the other nodes this can slow down = the target node (I think it's more intelligent now about what is moved). = This will be exaggerated in a 3 node cluster as you are removing 33% of = the processing and adding some (temporary) extra load to the remaining = nodes.

after all, I think there is = alot of human work to do to maintain the cluster which make it = impossible to scale to thousands of nodes, 
Automation, Automation, Automation is the only = way to go.

Chef, Puppet, CF Engine for general = config and deployment; Cloud Kick, munin, ganglia etc for monitoring. = And
Ops Centre (http://www.datastax.com/products/opscenter) for = cassandra specific management.

I am totally wrong about = all of this, currently I am serving 1 millions pv every day with = Cassandra and it make me feel unsafe, I am afraid one day one node crash = will cause the data broken and all cluster goes wrong....
With RF3 and a 3Node cluster you have room to = lose one node and the cluster will be up for 100% of the keys. While = better than having to worry about *the* database server, it's still = entry level fault tolerance. With RF 3 in a 6 Node cluster you can lose = up to 2 nodes and still be up for 100% of the keys.

Is there something you are specifically concerned = about with your current installation ?

Cheers

-----------------
Aaron Morton
Freelance = Cassandra Developer
@aaronmorton

On 8 Jul 2011, at 08:50, Yan Chunlu = wrote:

hi, all:
I am curious about = how large that Cassandra can scale? 

from the information I can = get, the largest usage is at facebook, which is about 150 nodes. =  in the mean time they are using 2000+ nodes with Hadoop, and yahoo = even using 4000 nodes of Hadoop. 

I am not understand why is = the situation, I only have  little knowledge with Cassandra and = even no knowledge with Hadoop. 



currently I am using = cassandra with 3 nodes and having problem bring one back after it out of = sync, the problems I encountered making me worry about how cassandra = could scale out: 

1):  the load balance = need to manually performed on every node, according = to: 

def = tokens(nodes): 

for x in = xrange(nodes): 

print 2 ** 127 / nodes * = x 



2): when adding new nodes, = need to perform node repair and cleanup on every node 



3) when decommission a = node, there is a chance that slow down the entire cluster. (not sure why = but I saw people ask around about it.) and the only way to do is = shutdown the entire the cluster, rsync the data, and start all nodes = without the decommission one. 





after all, I think there is = alot of human work to do to maintain the cluster which make it = impossible to scale to thousands of nodes, but I hope I am totally wrong = about all of this, currently I am serving 1 millions pv every day with = Cassandra and it make me feel unsafe, I am afraid one day one node crash = will cause the data broken and all cluster goes = wrong.... 



in the contrary, relational = database make me feel safety but it does not scale = well. 



thanks for any guidance = here.

=



-- =
Charles


= --Apple-Mail-1--911308251--