incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kennedy <stinkym...@gmail.com>
Subject Re: Understanding index builds (updated: crashed cluster)
Date Thu, 10 Mar 2011 23:51:57 GMT
Great, that worked, thanks for your time.

On Thu, Mar 10, 2011 at 4:57 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Drop the index, then restart once more.  It shouldn't try to rebuild
> the index after that.
>
> On Thu, Mar 10, 2011 at 3:36 PM, Matt Kennedy <stinkymatt@gmail.com>
> wrote:
> > Sorry, I wasn't clear on the timeline of events.  I started the index
> build
> > and then posted this message to the list. Once I read the links you
> posted,
> > I did expect the cluster to crash, but I let it run until it blew up
> anyway,
> > since I didn't really know how to stop the index build.
> >
> > Which is sort of where I'm still stuck, I don't want to corrupt that
> column
> > family by issuing an "update column family" that has a smaller set of
> > indexes while the index build is going on without some encouragement from
> > the list that doing that won't wreck the column family. Is there a safe
> way
> > to tell an index build to stop after the cluster starts up from a crash
> due
> > to the index build?
> >
> > Thanks,
> > Matt
> >
> > On Thu, Mar 10, 2011 at 1:40 PM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >>
> >> If you read the bugs I linked, you would see that this is expected
> >> behavior with 0.7.3 once you get more data than you can index
> >> in-memory.
> >>
> >> You should wait for the next Hudson build (which will include 2295)
> >> and use that.  Or, create your indexes before adding the data.
> >>
> >> On Thu, Mar 10, 2011 at 12:26 PM, Matt Kennedy <stinkymatt@gmail.com>
> >> wrote:
> >> > Well it looks like the index creation job crashed the cluster.  All of
> >> > the
> >> > nodes were down having dumped out .hprof files.  I brought the cluster
> >> > back
> >> > up and when I do "describe keyspace ks" it looks like the index build
> >> > process has started over again.  Is it safe to attempt to stop that by
> >> > running an "update column family" command with fewer indexes defined?
> >> > Or is
> >> > there a better way to safely terminate this index creation process
> that
> >> > I
> >> > assume will crash the cluster again eventually?
> >> >
> >> > Would creating the indexes one at a time help? Or will the same
> problem
> >> > occur once I get to a certain number of indexes on the column family?
> >> >
> >> > Thanks,
> >> > Matt
> >> >
> >> > On Wed, Mar 9, 2011 at 8:40 PM, Jonathan Ellis <jbellis@gmail.com>
> >> > wrote:
> >> >>
> >> >> https://issues.apache.org/jira/browse/CASSANDRA-2294
> >> >> https://issues.apache.org/jira/browse/CASSANDRA-2295
> >> >>
> >> >> On Wed, Mar 9, 2011 at 5:47 PM, Matt Kennedy <stinkymatt@gmail.com>
> >> >> wrote:
> >> >> > I'm trying to gain some insight into what happens with a cluster
> when
> >> >> > indexes are being built, or when CFs with indexed columns are
being
> >> >> > written
> >> >> > to.
> >> >> >
> >> >> > Over the past couple of days we've been doing some loads into
a CF
> >> >> > with
> >> >> > 29
> >> >> > indexed columns.  Eventually, the nodes just got overwhelmed and
> the
> >> >> > client
> >> >> > (Hector) started getting timeouts.  We were using using a MapReduce
> >> >> > job
> >> >> > to
> >> >> > load an HDFS file into Cassandra, though we had limited the load
> job
> >> >> > to
> >> >> > one
> >> >> > task per node.  My confusion comes from how difficult it was to
> know
> >> >> > that
> >> >> > the nodes were becoming overwhelmed.  The ring consistently
> reported
> >> >> > that
> >> >> > all nodes were up and it did not appear that there were pending
> >> >> > operations
> >> >> > under tpstats.  I also monitor this cluster with Ganglia, and
at no
> >> >> > point
> >> >> > did any of the machine loads appear very high at all, yet our
job
> >> >> > kept
> >> >> > failing with Hector reporting timeouts.
> >> >> >
> >> >> > Today we decided to leave index creation until the end, and just
> load
> >> >> > the
> >> >> > data using the same Hector code.  We bumped up the hadoop
> concurrency
> >> >> > to
> >> >> > two
> >> >> > concurrent tasks per node, and everything went fine, as expected,
> >> >> > we've
> >> >> > done
> >> >> > much larger loads than this using Hadoop and as long as you don't
> >> >> > shoot
> >> >> > for
> >> >> > too much concurrency, Cassandra can deal with it.  So now we have
> the
> >> >> > data
> >> >> > in the column family and I updated the column family metadata
in
> the
> >> >> > CLI
> >> >> > to
> >> >> > enable the 29 indexes.  As soon as I do that, the ring starts
> >> >> > reporting
> >> >> > that
> >> >> > nodes are down intermittently, and HintedHandoffs are starting
to
> >> >> > accumulate
> >> >> > under tpstats. Ganglia is reporting very low overall load, so
I'm
> >> >> > wondering
> >> >> > why it's taking so long for cli and nodetool commands to return.
> >> >> >
> >> >> > I'm just trying to get a better handle on what kind of actions
have
> a
> >> >> > serious impact on cluster availability and to know the right places
> >> >> > to
> >> >> > look
> >> >> > to try to get ahead of those conditions.
> >> >> >
> >> >> > Thanks for any insight you can provide,
> >> >> > Matt
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jonathan Ellis
> >> >> Project Chair, Apache Cassandra
> >> >> co-founder of DataStax, the source for professional Cassandra support
> >> >> http://www.datastax.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Mime
View raw message