Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 69501 invoked from network); 17 Sep 2010 14:42:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Sep 2010 14:42:08 -0000 Received: (qmail 25712 invoked by uid 500); 17 Sep 2010 14:42:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 25583 invoked by uid 500); 17 Sep 2010 14:42:02 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 25574 invoked by uid 99); 17 Sep 2010 14:42:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Sep 2010 14:42:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jedd.rashbrooke@imagini.net designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Sep 2010 14:41:54 +0000 Received: by bwz9 with SMTP id 9so3652658bwz.31 for ; Fri, 17 Sep 2010 07:41:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.127.141 with SMTP id g13mr3883533bks.54.1284734493308; Fri, 17 Sep 2010 07:41:33 -0700 (PDT) Received: by 10.204.63.65 with HTTP; Fri, 17 Sep 2010 07:41:33 -0700 (PDT) Date: Fri, 17 Sep 2010 15:41:33 +0100 Message-ID: Subject: Dazed and confused with Cassandra on EC2 ... From: Jedd Rashbrooke To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Howdi, I've just landed in an experiment to get Cassandra going, and fed by PHP via Thrift via Hadoop, all running on EC2. I've been lurking a bit on the list for a couple of weeks, mostly reading any threads with the word 'performance' in them. Few people have anything polite to say about EC2, but I want to just throw out some observations and get some feedback on whether what I'm seeing is even approaching any kind of normal. My background is mostly *nix and networking, with half-way decent understanding of DB's -- but Cassandra, Hadoop, Thrift and EC2 are all fairly new to me. We're using a four-node decently-specced (m2.2xlarge, if you're EC2-aware) cluster - 32GB, 4-core, if you're not :) I'm using Ubuntu with the Deb packages for Cassandra and Hadoop, and some fairly conservative tweaks to things like JVM memory (bumping them up to 4GB, then 16GB). One of our insert jobs - a mapper only process - was running pretty fast a few days ago. Somewhere around a million lines of input, split into a dozen files, inserting via a Hadoop job in about a half hour. Happy times. This was when the cluster was modestly sized - 20-50GB. It's now about 200GB, and performance has dropped by an order of magnitude - perhaps 5-6 hours to do the same amount of work, using the same codebase and the same input data. I've read that reads slow down as the DB grows, but had an expectation that writes would be consistently snappy. How surprising is this performance drop given the DB growth? My 4-node cluster started off as a 2-node - and now nodetool ring suggests the two original nodes are 200GB each, and the newer two are 40GB. Is this normal? Would a rebalance likely improve performance substantially? My feeling is that it would be expensive to perform. EC2 seems to get a bad rap, and we're feeling quite a bit of pain, which is sad given the (on paper) spec of the machines, and the cost - over US$3k/month for the cluster. I've split Cassandra commitlog, Cassandra data, hadoop(hdfs) and tmp onto separate 'spindles' - observations so far suggest late '90's disk IO speed (15MB max sustained writes, one machine, one disk to another), and consistently inconsistent performance (identical machine next to it running the same task at the same time was getting 28MB) over several hours. Cassandra nodes seems to disappear too easily - even with just one core (out of four) maxed out with a jsvc task, minimal disk or network activity, the machine feels very sluggish. Tailing the cassandra logs hints that it's doing hinted handoffs and occasionally compaction tasks. I've never seen this kind of behaviour - and suspect this is more a feature of EC2. Gossip now seems to be pining the loss of an older machine (that I stupidly took offline briefly - EC2 gave it a new IP address when it came back). There's nothing in the storage-conf to refer to the old address, all 4 Cassandra daemons have been re-started several times since, but gossip occasionally (a day later) says that it is looking for it - and more worrying that it is 'now part of the cluster'. I'm unsure if this is just an irritation or part of the underlying problem. What I'm going to do next is to try importing some data into a local machine - it's just time-consuming to pull in our S3 data - and see if I can fake up to around the same capacity and watch for performance degradation. I'm also toying with the idea of going from 4 to 8 nodes, but I'm clueless on whether / how much this would help. As I say, though, I'm keen on anyone else's observations on my observations - I'm painfully aware that I'm juggling a lot of unknown factors at the moment. cheers, Jedd.