Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0956E611D for ; Sun, 10 Jul 2011 06:56:01 +0000 (UTC) Received: (qmail 99252 invoked by uid 500); 10 Jul 2011 06:55:57 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 98983 invoked by uid 500); 10 Jul 2011 06:55:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 98975 invoked by uid 99); 10 Jul 2011 06:55:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Jul 2011 06:55:37 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of springrider@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Jul 2011 06:55:30 +0000 Received: by eye13 with SMTP id 13so1216593eye.31 for ; Sat, 09 Jul 2011 23:55:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=0/OvWqI8AApkZHrvq5PwMvOUJhHv3knfrf4gLYRALdk=; b=YRtQlmP1dJl45j7gpkiQ8Af06u4/vog9c+/QyZhjHy0eHiLZR+sEOqbvPgA6Q4WViz UB/4thmj1M3JNeXP0V951d8rONV0Elb6sKLNk1WHWIeIYPzD9fceg4WxMk7675JkoPx1 li4LUGeELDFhUKAuV5l2x3U9poQ70uTATsJl8= Received: by 10.213.22.74 with SMTP id m10mr1119941ebb.117.1310280908214; Sat, 09 Jul 2011 23:55:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.213.8.129 with HTTP; Sat, 9 Jul 2011 23:54:48 -0700 (PDT) In-Reply-To: References: From: Yan Chunlu Date: Sun, 10 Jul 2011 14:54:48 +0800 Message-ID: Subject: Re: how large cassandra could scale when it need to do manual operation? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174be78217105004a7b18cf6 X-Virus-Checked: Checked by ClamAV on apache.org --0015174be78217105004a7b18cf6 Content-Type: text/plain; charset=ISO-8859-1 I missed the consistency level part, thanks very much for the explanation. that is clear enough. On Sun, Jul 10, 2011 at 7:57 AM, aaron morton wrote: > about the decommission problem, here is the link: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html > > The key part of that post is "and since the second node was under heavy > load, and not enough ram, it was busy GCing and worked horribly slow" . > > maybe I was misunderstanding the replication factor, doesn't it RF=3 means > I could lose two nodes and still have one available(with 100% of the keys), > once Nodes>=3? > > When you start losing replicas the CL you use dictates if the cluster is > still up for 100% of the keys. See > http://thelastpickle.com/2011/06/13/Down-For-Me/ > > I have the strong willing to set RF to a very high value... > > As chris said 3 is about normal, it means the QUORUM CL is only 2 nodes. > > I am also trying to deploy cassandra across two datacenters(with 20ms >> latency). >> > Lookup LOCAL_QUORUM in the wiki > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 9 Jul 2011, at 02:01, Chris Goffinet wrote: > > As mentioned by Aaron, yes we run hundreds of Cassandra nodes across > multiple clusters. We run with RF of 2 and 3 (most common). > > We use commodity hardware and see failure all the time at this scale. We've > never had 3 nodes that were in same replica set, fail all at once. We > mitigate risk by being rack diverse, using different vendors for our hard > drives, designed workflows to make sure machines get serviced in certain > time windows and have an extensive automated burn-in process of (disk, > memory, drives) to not roll out nodes/clusters that could fail right away. > > On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu wrote: > >> thank you very much for the reply. which brings me more confidence on >> cassandra. >> I will try the automation tools, the examples you've listed seems quite >> promising! >> >> >> about the decommission problem, here is the link: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html >> I am also trying to deploy cassandra across two datacenters(with 20ms >> latency). so I am worrying about the network latency will even make it >> worse. >> >> maybe I was misunderstanding the replication factor, doesn't it RF=3 means >> I could lose two nodes and still have one available(with 100% of the keys), >> once Nodes>=3? besides I am not sure what's twitters setting on RF, but it >> is possible to lose 3 nodes in the same time(facebook once encountered photo >> loss because there RAID broken, rarely happen though). I have the strong >> willing to set RF to a very high value... >> >> Thanks! >> >> >> On Sat, Jul 9, 2011 at 5:22 AM, aaron morton wrote: >> >>> AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time >>> ago. Twitter is a vocal supporter with a large Apache Cassandra install, >>> e.g. "Twitter currently runs a couple hundred Cassandra nodes across a half >>> dozen clusters. " >>> http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011 >>> >>> >>> >>> If >>> you are working with a 3 node cluster removing/rebuilding/what ever one node >>> will effect 33% of your capacity. When you scale up the contribution from >>> each individual node goes down, and the impact of one node going down is >>> less. Problems that happen with a few nodes will go away at scale, to be >>> replaced by a whole set of new ones. >>> >>> >>> 1): the load balance need to manually performed on every node, according >>> to: >>> >>> Yes >>> >>> 2): when adding new nodes, need to perform node repair and cleanup on >>> every node >>> >>> You only need to run cleanup, see >>> http://wiki.apache.org/cassandra/Operations#Bootstrap >>> >>> 3) when decommission a node, there is a chance that slow down the entire >>> cluster. (not sure why but I saw people ask around about it.) and the only >>> way to do is shutdown the entire the cluster, rsync the data, and start all >>> nodes without the decommission one. >>> >>> I cannot remember any specific cases where decommission requires a full >>> cluster stop, do you have a link? With regard to slowing down, the >>> decommission process will stream data from the node you are removing onto >>> the other nodes this can slow down the target node (I think it's more >>> intelligent now about what is moved). This will be exaggerated in a 3 node >>> cluster as you are removing 33% of the processing and adding some >>> (temporary) extra load to the remaining nodes. >>> >>> after all, I think there is alot of human work to do to maintain the >>> cluster which make it impossible to scale to thousands of nodes, >>> >>> Automation, Automation, Automation is the only way to go. >>> >>> Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, >>> munin, ganglia etc for monitoring. And >>> Ops Centre (http://www.datastax.com/products/opscenter) for cassandra >>> specific management. >>> >>> I am totally wrong about all of this, currently I am serving 1 millions >>> pv every day with Cassandra and it make me feel unsafe, I am afraid one day >>> one node crash will cause the data broken and all cluster goes wrong.... >>> >>> With RF3 and a 3Node cluster you have room to lose one node and the >>> cluster will be up for 100% of the keys. While better than having to worry >>> about *the* database server, it's still entry level fault tolerance. With RF >>> 3 in a 6 Node cluster you can lose up to 2 nodes and still be up for 100% of >>> the keys. >>> >>> Is there something you are specifically concerned about with your current >>> installation ? >>> >>> Cheers >>> >>> ----------------- >>> Aaron Morton >>> Freelance Cassandra Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 8 Jul 2011, at 08:50, Yan Chunlu wrote: >>> >>> hi, all: >>> I am curious about how large that Cassandra can scale? >>> >>> from the information I can get, the largest usage is at facebook, which >>> is about 150 nodes. in the mean time they are using 2000+ nodes with >>> Hadoop, and yahoo even using 4000 nodes of Hadoop. >>> >>> I am not understand why is the situation, I only have little knowledge >>> with Cassandra and even no knowledge with Hadoop. >>> >>> >>> >>> currently I am using cassandra with 3 nodes and having problem bring one >>> back after it out of sync, the problems I encountered making me worry about >>> how cassandra could scale out: >>> >>> 1): the load balance need to manually performed on every node, according >>> to: >>> >>> def tokens(nodes): >>> >>> for x in xrange(nodes): >>> >>> print 2 ** 127 / nodes * x >>> >>> >>> >>> 2): when adding new nodes, need to perform node repair and cleanup on >>> every node >>> >>> >>> >>> 3) when decommission a node, there is a chance that slow down the entire >>> cluster. (not sure why but I saw people ask around about it.) and the only >>> way to do is shutdown the entire the cluster, rsync the data, and start all >>> nodes without the decommission one. >>> >>> >>> >>> >>> >>> after all, I think there is alot of human work to do to maintain the >>> cluster which make it impossible to scale to thousands of nodes, but I hope >>> I am totally wrong about all of this, currently I am serving 1 millions pv >>> every day with Cassandra and it make me feel unsafe, I am afraid one day one >>> node crash will cause the data broken and all cluster goes wrong.... >>> >>> >>> >>> in the contrary, relational database make me feel safety but it does not >>> scale well. >>> >>> >>> >>> thanks for any guidance here. >>> >>> >>> >> >> >> -- >> Charles >> > > > -- Charles --0015174be78217105004a7b18cf6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I missed the consistency level part, thanks very much for the explanation. = that is clear enough.

On Sun, Jul 10, 201= 1 at 7:57 AM, aaron morton <aaron@thelastpickle.com> w= rote:
The key part of that post is "and = since the second node was under heavy load, and not enough ram, it was busy= GCing and worked horribly slow" .=A0

maybe I was misunderstanding the r= eplication factor, doesn't it RF=3D3 means I could lose two nodes and s= till have one available(with 100% of the keys), once Nodes>=3D3?
When you start losing replicas the CL you use dict= ates if the cluster is still up for 100% of the keys. See=A0http://thel= astpickle.com/2011/06/13/Down-For-Me/=A0

=A0I have= the strong willing to set RF to a very high value...
As chris said 3 is about normal, it means the QUORUM CL is o= nly 2 nodes.=A0

I am also trying to deploy cassandra across two datacenters(= with 20ms latency).
Lookup LOCAL_QUORUM in the wiki

Hope that helps.=A0

-----------------
Aaron Morton
Freelance Cass= andra Developer
@aaronmorton

On 9 Jul 2011, at 02:01, Chris Gof= finet wrote:

As mentioned by Aaron, yes = we run hundreds of Cassandra nodes across multiple clusters. We run with RF= of 2 and 3 (most common).=A0

We use commodity hardware and see failure all the time at th= is scale. We've never had 3 nodes that were in same replica set, fail a= ll at once. We mitigate risk by being rack diverse, using different vendors= for our hard drives, designed workflows to make sure machines get serviced= in certain time windows and have an extensive automated burn-in process of= (disk, memory, drives) to not roll out nodes/clusters that could fail righ= t away.

On Sat, Jul 9, 2011 at 12:17 AM, Yan Ch= unlu <springrider@gmail.com> wrote:
thank you very much for the reply. which brings me more confidence on cassa= ndra.
I will try the automation tools, the examples you've lis= ted seems quite promising!


=A0I am also trying to deploy cassandra across two datacenters(with 20= ms latency). so I am worrying about the network latency will even make it w= orse. =A0

maybe I was misunderstanding the replica= tion factor, doesn't it RF=3D3 means I could lose two nodes and still h= ave one available(with 100% of the keys), once Nodes>=3D3? =A0 besides I= am not sure what's twitters setting on RF, but it is possible to lose = 3 nodes in the same time(facebook once encountered photo loss because there= RAID broken, rarely happen though). I have the strong willing to set RF to= a very high value...

Thanks!

<= br>
On Sat, Jul 9, 2011 at 5:22 AM, aaron morton = <aaron@thelastpickle.com> wrote:
AFAIK Fa= cebook Cassandra and Apache Cassandra diverged paths a long time ago. Twitt= er is a vocal supporter with a large Apache Cassandra install, e.g. "T= witter=A0currently runs a couple hundred Cassandra nodes across a half doze= n clusters.=A0"=A0http://ww= w.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2= 011


If you are working with a 3 node cluster removing/rebuilding/what ever on= e node will effect 33% of your capacity. When you scale up the contribution= from each individual node goes down, and the impact of one node going down= is less. Problems that happen with a few nodes will go away at scale, to b= e replaced by a whole set of new ones. =A0=A0


1): =A0the load balan= ce need to manually performed on every node, according to:=A0
Yes
2): when adding new nodes, need= to perform node repair and cleanup on every node=A0

3) when decommission= a node, there is a chance that slow down the entire cluster. (not sure why= but I saw people ask around about it.) and the only way to do is shutdown = the entire the cluster, rsync the data, and start all nodes without the dec= ommission one.=A0
I= cannot remember any specific cases where decommission requires a full clus= ter stop, do you have a link? With regard to slowing down, the decommission= process will stream data from the node you are removing onto the other nod= es this can slow down the target node (I think it's more intelligent no= w about what is moved). This will be exaggerated in a 3 node cluster as you= are removing 33% of the processing and adding some (temporary) extra load = to the remaining nodes.

after all, I think there is alot of human work to do to mainta= in the cluster which make it impossible to scale to thousands of nodes,=A0<= /span>
Automation, Automation, Automation is the only way = to go.

Chef, Puppet, CF Engine for general config= and deployment; Cloud Kick, munin, ganglia etc for monitoring. And
Ops Centre (http://www.datastax.com= /products/opscenter) for cassandra specific management.

I am totally wrong about all of this, currently I am serving 1 mill= ions pv every day with Cassandra and it make me feel unsafe, I am afraid on= e day one node crash will cause the data broken and all cluster goes wrong.= ...
With RF3 and a 3Node cluster you have room to lose = one node and the cluster will be up for 100% of the keys. While better than= having to worry about *the* database server, it's still entry level fa= ult tolerance. With RF 3 in a 6 Node cluster you can lose up to 2 nodes and= still be up for 100% of the keys.

Is there something you are specifically concerned about= with your current installation ?

Cheers

-----------------
Aaron Morton
Freelance Cass= andra Developer
@aaronmorton

On 8 Jul 2011, at 08:50, Yan Chunlu wrote:

hi, all:
I am curious about how large th= at Cassandra can scale?=A0
from the information I can get, the largest usage is a= t facebook, which is about 150 nodes. =A0in the mean time they are using 20= 00+ nodes with Hadoop, and yahoo even using 4000 nodes of Hadoop.=A0=

I am not understand why is the situation, I only have = =A0little knowledge with Cassandra and even no knowledge with Hadoop.=A0



currently I am using cassandra with 3 nodes and having= problem bring one back after it out of sync, the problems I encountered ma= king me worry about how cassandra could scale out:=A0

1): =A0the load balance need to manually performed on = every node, according to:=A0<= br>
def tokens(nodes):=A0

for x in xrange(nodes):=A0

print 2 ** 127 / nodes * x=A0



2): when adding new nodes, need to perform node repair= and cleanup on every node=A0=



3) when decommission a node, there is a chance that sl= ow down the entire cluster. (not sure why but I saw people ask around about= it.) and the only way to do is shutdown the entire the cluster, rsync the = data, and start all nodes without the decommission one.=A0





after all, I think there is alot of human work to do t= o maintain the cluster which make it impossible to scale to thousands of no= des, but I hope I am totally wrong about all of this, currently I am servin= g 1 millions pv every day with Cassandra and it make me feel unsafe, I am a= fraid one day one node crash will cause the data broken and all cluster goe= s wrong....=A0



in the contrary, relational database make me feel safe= ty but it does not scale well.=A0



thanks for any guidance here.




--
Charles




--
Charles --0015174be78217105004a7b18cf6--