Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 48149 invoked from network); 26 Oct 2010 20:09:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Oct 2010 20:09:01 -0000 Received: (qmail 45873 invoked by uid 500); 26 Oct 2010 20:08:59 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 45850 invoked by uid 500); 26 Oct 2010 20:08:59 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 45842 invoked by uid 99); 26 Oct 2010 20:08:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 20:08:59 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gdusbabek@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 20:08:52 +0000 Received: by eyd10 with SMTP id 10so2559948eyd.31 for ; Tue, 26 Oct 2010 13:08:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:reply-to :in-reply-to:references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=/s9LMPKM8s/Pf69OExZIdvpnqlR76zHd7X1heTWKHJ4=; b=Kuux/TKuu0Nm0ry76inOiIBAwkpDowYy1gkfKl11W4gGt9wZDBFJLRz63OOvX0edUq QD1VSNbPuFeQ5d1ASzM1FSORaelcszMwdjqgKG7QfpLAUQJ/6T8qviYbkhWgmauy3KFT 2kRThIIBxq/NQVAJmfjtWEkhLYOqrPappPgyk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; b=gn1BSddR+Ecf/9xlA+IlSx8MZsCwfA+A3JvrX43SunuDIaxcbVwaSQKqhiJDLDikyk wMlVl38GHJ0a9zhmldQmKoUmlFQmdwA/ssLdpc+uBYkM5HFu3FmlGYQST0fp0itCSqdD cViSRd9LEXFGbWGUifSYDEQzp2zzrk3ex0lYU= MIME-Version: 1.0 Received: by 10.216.186.207 with SMTP id w57mr2228607wem.19.1288123711879; Tue, 26 Oct 2010 13:08:31 -0700 (PDT) Received: by 10.216.165.140 with HTTP; Tue, 26 Oct 2010 13:08:31 -0700 (PDT) Reply-To: gdusbabek@gmail.com In-Reply-To: References: <986CB10F20A14445A4C0C95D4E491B320EAD022E@ExchMBX104.netflix.com> <1288115100.81515077@192.168.2.228> Date: Tue, 26 Oct 2010 15:08:31 -0500 Message-ID: Subject: Re: Best practice for adding new nodes to ring From: Gary Dusbabek To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Oct 26, 2010 at 14:56, Edward Capriolo wrot= e: > On Tue, Oct 26, 2010 at 1:45 PM, Stu Hood wrote: >> While the "adding virtual tokens/nodes to Cassandra" discussion is a goo= d one, there are a few factors that might delay (or remove?) the necessity = of adding that complexity: >> >> * In Cassandra 0.7, removing load from a node is fairly cheap: a bounded= number of reads are used to determine which portions of the large sorted d= ata files (sstables) to stream, followed by "sendfile" calls to deliver the= data to the destination >> * For a replication factor RF, RF nodes can send data to a new node: thi= s means that to have all existing N nodes in your cluster participate in ad= ding K nodes, you only need to add N / RF =3D K nodes per expansion: this i= s a much easier factor to achieve than a power of 2. >> >> While the added nodes will not be immediately balanced, there are some p= ossible improvements to our existing load-balancing facilities to better ha= ndle unbalanced cases: see https://issues.apache.org/jira/browse/CASSANDRA-= 1418 >> >> Finally, virtual nodes are not a panacea: reviewing the papers on https:= //issues.apache.org/jira/browse/CASSANDRA-192 suggests that they are signif= icantly more difficult to implement than our current solution. >> >> We haven't ruled virtual nodes out, but I think many of us are leaning t= oward exploring improvements to our current architecture. >> >> Thanks, >> Stu >> >> -----Original Message----- >> From: "Greg Kim" >> Sent: Tuesday, October 26, 2010 12:21pm >> To: "user@cassandra.apache.org" >> Subject: Best practice for adding new nodes to ring >> >> Hi, >> >> I have a question regarding the best practices for adding new nodes to a= n existing cluster. =A0From reading the following wiki: http://wiki.apache.= org/cassandra/Operations =A0-- I understand that when creating a brand new = cluster -- we can use the following to calculate the initial token for each= node to achieve balance in the ring: >> =A0def tokens(nodes): >> =A0 =A0 for i in range(1, nodes + 1): >> =A0 =A0 =A0 =A0 print (i * (2 ** 127 - 1) / nodes) >> >> >> My question is on the best practice for adding new nodes to an existing = cluster. =A0There is a recommendation in the wiki which is to basically to = compute new tokens for every node and assign them manually using the nodeto= ol command. =A0We're planning on running either 16GB or 32GB heaps on each = of our nodes, so token re-assignment for each node in the cluster sounds li= ke a very expensive operation especially in situations where we're adding n= ew nodes to handle scaling issues w/ the existing cluster. >> >> I'm bit of a noob to cassandra, so wanted to see how others are currentl= y coping w/ this. =A0One option can be to grow the cluster in the power of = 2 and use bootstraping w/ automatic token generation. =A0Is this an option = that people are using? (but this gets exponentially expensive when you alre= ady have a large # of nodes) >> >> Does anyone know why cassandra doesn't use virtual tokens (e.g. one node= token - creating 256 virtual node tokens in the ring)? =A0This way adding = new nodes to an existing cluster will significantly mitigate the unbalance = issue in the ring. >> >> >> Thanks >> gkim >> >> > > One could implement "Virtual nodes" by running multiple instances of > cassandra on a single machine, each binding to a different IP, > possibly each using a different physical disk. > > I can imagine this would cause some overhead and waste. However since > current JVM's do not manage large heap sizes well this would be the > way I would imagine running cassandra on a "Big iron/mainframe" > machine with 128GB RAM 4 processors and 48 disks You'd just want to make sure you have the IO capacity to handle it. Personally, I think 8- or possibly 4- way systems would be up to the task CPU-wise, but you'd have to think long and hard about how you would manage disk IO. Gary.