Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 75588 invoked from network); 9 Mar 2011 23:47:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Mar 2011 23:47:41 -0000 Received: (qmail 31584 invoked by uid 500); 9 Mar 2011 23:47:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 31549 invoked by uid 500); 9 Mar 2011 23:47:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 31541 invoked by uid 99); 9 Mar 2011 23:47:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Mar 2011 23:47:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stinkymatt@gmail.com designates 209.85.213.172 as permitted sender) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Mar 2011 23:47:33 +0000 Received: by yxk30 with SMTP id 30so583507yxk.31 for ; Wed, 09 Mar 2011 15:47:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=N+mu+TOH+Uxy3YJFhYYSOOO9bGBEkE/SLu1FySHPZss=; b=mo5YX5vS/aSeDybm4p+s7/Vx/zGkK7E1FYvJDLkPQKu7r882ZeST+VLuAFQys7oMJp mRZ0lfIe7bz/YBYh3eqIU4sFwleGpbEucSrlPEV6bcQHv1B+BoZE9T5/5XVYciM7OcFn gGkq54+s+nkut6eFIhldWo1llBggc/F354MdI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=YFhEluZTIKDnzOKl+exCHtPE90ri7MaO1+XGXAdavDw4qveSHfM/ky39zvLh1Adbwm UmwHTZZcIFsbW+LD65VENxippAByTe9uq0QrKiTHx6P5giKMslc+xrs7RmEfPGfFbEZe 6DJjPN11dOT5W9GxaBZMEIM1L8cGjyvxu7/jg= MIME-Version: 1.0 Received: by 10.151.5.20 with SMTP id h20mr76945ybi.229.1299714432718; Wed, 09 Mar 2011 15:47:12 -0800 (PST) Received: by 10.150.181.19 with HTTP; Wed, 9 Mar 2011 15:47:12 -0800 (PST) Date: Wed, 9 Mar 2011 18:47:12 -0500 Message-ID: Subject: Understanding index builds From: Matt Kennedy To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd291f0125c0e049e15597f --000e0cd291f0125c0e049e15597f Content-Type: text/plain; charset=ISO-8859-1 I'm trying to gain some insight into what happens with a cluster when indexes are being built, or when CFs with indexed columns are being written to. Over the past couple of days we've been doing some loads into a CF with 29 indexed columns. Eventually, the nodes just got overwhelmed and the client (Hector) started getting timeouts. We were using using a MapReduce job to load an HDFS file into Cassandra, though we had limited the load job to one task per node. My confusion comes from how difficult it was to know that the nodes were becoming overwhelmed. The ring consistently reported that all nodes were up and it did not appear that there were pending operations under tpstats. I also monitor this cluster with Ganglia, and at no point did any of the machine loads appear very high at all, yet our job kept failing with Hector reporting timeouts. Today we decided to leave index creation until the end, and just load the data using the same Hector code. We bumped up the hadoop concurrency to two concurrent tasks per node, and everything went fine, as expected, we've done much larger loads than this using Hadoop and as long as you don't shoot for too much concurrency, Cassandra can deal with it. So now we have the data in the column family and I updated the column family metadata in the CLI to enable the 29 indexes. As soon as I do that, the ring starts reporting that nodes are down intermittently, and HintedHandoffs are starting to accumulate under tpstats. Ganglia is reporting very low overall load, so I'm wondering why it's taking so long for cli and nodetool commands to return. I'm just trying to get a better handle on what kind of actions have a serious impact on cluster availability and to know the right places to look to try to get ahead of those conditions. Thanks for any insight you can provide, Matt --000e0cd291f0125c0e049e15597f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I'm trying to gain some insight into what happens with a cluster when i= ndexes are being built, or when CFs with indexed columns are being written = to.

Over the past couple of days we've been doing some loads int= o a CF with 29 indexed columns.=A0 Eventually, the nodes just got overwhelm= ed and the client (Hector) started getting timeouts.=A0 We were using using= a MapReduce job to load an HDFS file into Cassandra, though we had limited= the load job to one task per node.=A0 My confusion comes from how difficul= t it was to know that the nodes were becoming overwhelmed.=A0 The ring cons= istently reported that all nodes were up and it did not appear that there w= ere pending operations under tpstats.=A0 I also monitor this cluster with G= anglia, and at no point did any of the machine loads appear very high at al= l, yet our job kept failing with Hector reporting timeouts.

Today we decided to leave index creation until the end, and just load t= he data using the same Hector code.=A0 We bumped up the hadoop concurrency = to two concurrent tasks per node, and everything went fine, as expected, we= 've done much larger loads than this using Hadoop and as long as you do= n't shoot for too much concurrency, Cassandra can deal with it.=A0 So n= ow we have the data in the column family and I updated the column family me= tadata in the CLI to enable the 29 indexes.=A0 As soon as I do that, the ri= ng starts reporting that nodes are down intermittently, and HintedHandoffs = are starting to accumulate under tpstats. Ganglia is reporting very low ove= rall load, so I'm wondering why it's taking so long for cli and nod= etool commands to return.

I'm just trying to get a better handle on what kind of actions have= a serious impact on cluster availability and to know the right places to l= ook to try to get ahead of those conditions.

Thanks for any insight = you can provide,
Matt
--000e0cd291f0125c0e049e15597f--