Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of jedd.rashbrooke@imagini.net
 designates 209.85.214.44 as permitted sender)
MIME-Version: 1.0
Date: Fri, 17 Sep 2010 15:41:33 +0100
Message-ID: <AANLkTim_66TEi7CbgioPLjtVSRio3Kac3rCKK0SyvtUk@mail.gmail.com>
Subject: Dazed and confused with Cassandra on EC2 ...
From: Jedd Rashbrooke <jedd.rashbrooke@imagini.net>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

 Howdi,

 I've just landed in an experiment to get Cassandra going, and
 fed by PHP via Thrift via Hadoop, all running on EC2.  I've been
 lurking a bit on the list for a couple of weeks, mostly reading any
 threads with the word 'performance' in them.  Few people have
 anything polite to say about EC2, but I want to just throw out
 some observations and get some feedback on whether what
 I'm seeing is even approaching any kind of normal.

 My background is mostly *nix and networking, with half-way
 decent understanding of DB's -- but Cassandra, Hadoop, Thrift
 and EC2 are all fairly new to me.

 We're using a four-node decently-specced (m2.2xlarge, if you're
 EC2-aware) cluster - 32GB, 4-core, if you're not :)  I'm using
 Ubuntu with the Deb packages for Cassandra and Hadoop, and
 some fairly conservative tweaks to things like JVM memory
 (bumping them up to 4GB, then 16GB).

 One of our insert jobs - a mapper only process - was running
 pretty fast a few days ago.  Somewhere around a million lines
 of input, split into a dozen files, inserting via a Hadoop job in
 about a half hour.  Happy times.  This was when the cluster
 was modestly sized - 20-50GB.  It's now about 200GB, and
 performance has dropped by an order of magnitude - perhaps
 5-6 hours to do the same amount of work, using the same
 codebase and the same input data.

 I've read that reads slow down as the DB grows, but had an
 expectation that writes would be consistently snappy.  How
 surprising is this performance drop given the DB growth?

 My 4-node cluster started off as a 2-node - and now nodetool
 ring suggests the two original nodes are 200GB each, and
 the newer two are 40GB.  Is this normal?  Would a rebalance
 likely improve performance substantially?  My feeling is that
 it would be expensive to perform.

 EC2 seems to get a bad rap, and we're feeling quite a bit of
 pain, which is sad given the (on paper) spec of the machines,
 and the cost - over US$3k/month for the cluster.  I've split
 Cassandra commitlog, Cassandra data, hadoop(hdfs) and
 tmp onto separate 'spindles' - observations so far suggest
 late '90's disk IO speed (15MB max sustained writes, one
 machine, one disk to another), and consistently inconsistent
 performance (identical machine next to it running the same
 task at the same time was getting 28MB) over several hours.

 Cassandra nodes seems to disappear too easily - even
 with just one core (out of four) maxed out with a jsvc task,
 minimal disk or network activity, the machine feels very
 sluggish.  Tailing the cassandra logs hints that it's doing
 hinted handoffs and occasionally compaction tasks.  I've
 never seen this kind of behaviour - and suspect this is
 more a feature of EC2.

 Gossip now seems to be pining the loss of an older machine
 (that I stupidly took offline briefly - EC2 gave it a new IP address
 when it came back).  There's nothing in the storage-conf to
 refer to the old address, all 4 Cassandra daemons have been
 re-started several times since, but gossip occasionally (a day
 later) says that it is looking for it - and more worrying that
 it is 'now part of the cluster'.  I'm unsure if this is just an
 irritation or part of the underlying problem.

 What I'm going to do next is to try importing some data into
 a local machine - it's just time-consuming to pull in our S3
 data - and see if I can fake up to around the same capacity
 and watch for performance degradation.

 I'm also toying with the idea of going from 4 to 8 nodes,
 but I'm clueless on whether / how much this would help.

 As I say, though, I'm keen on anyone else's observations on
 my observations - I'm painfully aware that I'm juggling a lot
 of unknown factors at the moment.

 cheers,
 Jedd.