Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 95129 invoked from network); 17 Sep 2010 15:53:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Sep 2010 15:53:12 -0000 Received: (qmail 49574 invoked by uid 500); 17 Sep 2010 15:53:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49252 invoked by uid 500); 17 Sep 2010 15:53:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49244 invoked by uid 99); 17 Sep 2010 15:53:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Sep 2010 15:53:06 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jedd.rashbrooke@imagini.net designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Sep 2010 15:53:01 +0000 Received: by bwz9 with SMTP id 9so3731079bwz.31 for ; Fri, 17 Sep 2010 08:52:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.57.146 with SMTP id c18mr3790530bkh.205.1284738759552; Fri, 17 Sep 2010 08:52:39 -0700 (PDT) Received: by 10.204.63.65 with HTTP; Fri, 17 Sep 2010 08:52:39 -0700 (PDT) In-Reply-To: References: Date: Fri, 17 Sep 2010 16:52:39 +0100 Message-ID: Subject: Re: Dazed and confused with Cassandra on EC2 ... From: Jedd Rashbrooke To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Dave, Thank you for your response. I can clarify a couple of things here: > 2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and th= e 2 > new ones have 40 GB. =A0What's the recommended practice for rebalancing (= i.e., > when should you do it), what's the actual procedure, and what's the expec= ted > impact of it? + is it likely to cause a problem in the short term if I don't (ie. if I just wait until 'normal activity' to somehow even out the distribution of data). > 3. Cassandra nodes "disappear". =A0(I'm not quite clear what this means.) Nodetool reports the node as down. I'm seeing lots of machine-x is DOWN in the logs. Flapping, actually. I don't have any swap configured (which= I've read somewhere might induce flapping). The machine also feels like it goes on a hiatus - separately, but typicall= y observed at the same time. Tail -f on the Cassandra logs delays for sever= al minutes, pending ssh's to the box also stall until 'something' happens tha= t releases the machine from its slumber. Typically that something is a message in the logs that a compaction of a hintedhandoff has completed. As I say, nmon/top show minimal network & disk activity, and just one of the four cores flatlining during this time. The machine *should* be more responsive. Actually: http://pastebin.com/AeM2VgL3 All the machines referenced in there are ones that are in the cluster now. > 4. You took a machine offline without decommissioning it from the cluster= . > =A0Now the machine is gone, but the other nodes (in Gossip logs) report t= hat > they are still looking for it. =A0How do you stop nodes from looking for = a > removed node? I was attempting to drain the thing first, but that was stalling, so I sto= pped Cassandra then stopped the box. The storage and config were on EBS (persistent disk) so they came back - it's just that the IP address of the machine changed. I typically use my own assigned hostnames (cass-01, cass-02, etc, say) but for proper resolution I use the EC2 'internal hostnames', which were updated to all four Cassandra boxes, the other three instances of Cassandra were stopped, and then all four brought back up. You say you have similar EC2-related thoughts .. have you done much on the EC2 hardware so far? Are you seeing the same kind of thing? cheers, Jedd.