Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of jedd.rashbrooke@imagini.net
 designates 209.85.214.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=65c50_+V=f0vyV+WdeOJHXt1_0A2D8CTa=+db@mail.gmail.com>
References: <AANLkTim_66TEi7CbgioPLjtVSRio3Kac3rCKK0SyvtUk@mail.gmail.com>
	<AANLkTi=65c50_+V=f0vyV+WdeOJHXt1_0A2D8CTa=+db@mail.gmail.com>
Date: Fri, 17 Sep 2010 16:52:39 +0100
Message-ID: <AANLkTikR0g6zb2nPUQpe8L+KSCNrXGG+c31GnGbJ13Ss@mail.gmail.com>
Subject: Re: Dazed and confused with Cassandra on EC2 ...
From: Jedd Rashbrooke <jedd.rashbrooke@imagini.net>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

 Hi Dave,

 Thank you for your response.

 I can clarify a couple of things here:

> 2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and th=
e 2
> new ones have 40 GB. =A0What's the recommended practice for rebalancing (=
i.e.,
> when should you do it), what's the actual procedure, and what's the expec=
ted
> impact of it?

 + is it likely to cause a problem in the short term if I don't (ie.
if I just wait
 until 'normal activity' to somehow even out the distribution of data).

> 3. Cassandra nodes "disappear". =A0(I'm not quite clear what this means.)

 Nodetool reports the node as down.  I'm seeing lots of machine-x is DOWN
 in the logs.  Flapping, actually.  I don't have any swap configured (which=
 I've
 read somewhere might induce flapping).

 The machine also feels like it goes on a hiatus - separately, but typicall=
y
 observed at the same time.  Tail -f on the Cassandra logs delays for sever=
al
 minutes, pending ssh's to the box also stall until 'something' happens tha=
t
 releases the machine from its slumber.  Typically that something is a
 message in the logs that a compaction of a hintedhandoff has completed.

 As I say, nmon/top show minimal network & disk activity, and just one
 of the four cores flatlining during this time.  The machine *should* be
 more responsive.

 Actually:   http://pastebin.com/AeM2VgL3

 All the machines referenced in there are ones that are in the cluster now.


> 4. You took a machine offline without decommissioning it from the cluster=
.
> =A0Now the machine is gone, but the other nodes (in Gossip logs) report t=
hat
> they are still looking for it. =A0How do you stop nodes from looking for =
a
> removed node?

 I was attempting to drain the thing first, but that was stalling, so I sto=
pped
 Cassandra then stopped the box.  The storage and config were on EBS
 (persistent disk) so they came back - it's just that the IP address of the
 machine changed.  I typically use my own assigned hostnames (cass-01,
 cass-02, etc, say) but for proper resolution I use the EC2 'internal
hostnames',
 which were updated to all four Cassandra boxes, the other three instances
 of Cassandra were stopped, and then all four brought back up.


 You say you have similar EC2-related thoughts .. have you done much on
 the EC2 hardware so far?  Are you seeing the same kind of thing?

 cheers,
 Jedd.