Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of stinkymatt@gmail.com
 designates 209.85.213.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=YFhEluZTIKDnzOKl+exCHtPE90ri7MaO1+XGXAdavDw4qveSHfM/ky39zvLh1Adbwm
         UmwHTZZcIFsbW+LD65VENxippAByTe9uq0QrKiTHx6P5giKMslc+xrs7RmEfPGfFbEZe
         6DJjPN11dOT5W9GxaBZMEIM1L8cGjyvxu7/jg=
MIME-Version: 1.0
Date: Wed, 9 Mar 2011 18:47:12 -0500
Message-ID: <AANLkTikGGbNxRo8zKp24eVcRvKKjp0xBRx03MJhTkhmJ@mail.gmail.com>
Subject: Understanding index builds
From: Matt Kennedy <stinkymatt@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0cd291f0125c0e049e15597f

--000e0cd291f0125c0e049e15597f
Content-Type: text/plain; charset=ISO-8859-1

I'm trying to gain some insight into what happens with a cluster when
indexes are being built, or when CFs with indexed columns are being written
to.

Over the past couple of days we've been doing some loads into a CF with 29
indexed columns.  Eventually, the nodes just got overwhelmed and the client
(Hector) started getting timeouts.  We were using using a MapReduce job to
load an HDFS file into Cassandra, though we had limited the load job to one
task per node.  My confusion comes from how difficult it was to know that
the nodes were becoming overwhelmed.  The ring consistently reported that
all nodes were up and it did not appear that there were pending operations
under tpstats.  I also monitor this cluster with Ganglia, and at no point
did any of the machine loads appear very high at all, yet our job kept
failing with Hector reporting timeouts.

Today we decided to leave index creation until the end, and just load the
data using the same Hector code.  We bumped up the hadoop concurrency to two
concurrent tasks per node, and everything went fine, as expected, we've done
much larger loads than this using Hadoop and as long as you don't shoot for
too much concurrency, Cassandra can deal with it.  So now we have the data
in the column family and I updated the column family metadata in the CLI to
enable the 29 indexes.  As soon as I do that, the ring starts reporting that
nodes are down intermittently, and HintedHandoffs are starting to accumulate
under tpstats. Ganglia is reporting very low overall load, so I'm wondering
why it's taking so long for cli and nodetool commands to return.

I'm just trying to get a better handle on what kind of actions have a
serious impact on cluster availability and to know the right places to look
to try to get ahead of those conditions.

Thanks for any insight you can provide,
Matt

--000e0cd291f0125c0e049e15597f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I&#39;m trying to gain some insight into what happens with a cluster when i=
ndexes are being built, or when CFs with indexed columns are being written =
to.<br><br>Over the past couple of days we&#39;ve been doing some loads int=
o a CF with 29 indexed columns.=A0 Eventually, the nodes just got overwhelm=
ed and the client (Hector) started getting timeouts.=A0 We were using using=
 a MapReduce job to load an HDFS file into Cassandra, though we had limited=
 the load job to one task per node.=A0 My confusion comes from how difficul=
t it was to know that the nodes were becoming overwhelmed.=A0 The ring cons=
istently reported that all nodes were up and it did not appear that there w=
ere pending operations under tpstats.=A0 I also monitor this cluster with G=
anglia, and at no point did any of the machine loads appear very high at al=
l, yet our job kept failing with Hector reporting timeouts.<br>
<br>Today we decided to leave index creation until the end, and just load t=
he data using the same Hector code.=A0 We bumped up the hadoop concurrency =
to two concurrent tasks per node, and everything went fine, as expected, we=
&#39;ve done much larger loads than this using Hadoop and as long as you do=
n&#39;t shoot for too much concurrency, Cassandra can deal with it.=A0 So n=
ow we have the data in the column family and I updated the column family me=
tadata in the CLI to enable the 29 indexes.=A0 As soon as I do that, the ri=
ng starts reporting that nodes are down intermittently, and HintedHandoffs =
are starting to accumulate under tpstats. Ganglia is reporting very low ove=
rall load, so I&#39;m wondering why it&#39;s taking so long for cli and nod=
etool commands to return.<br>
<br>I&#39;m just trying to get a better handle on what kind of actions have=
 a serious impact on cluster availability and to know the right places to l=
ook to try to get ahead of those conditions.<br><br>Thanks for any insight =
you can provide,<br>
Matt<br>

--000e0cd291f0125c0e049e15597f--