cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: practice failure recovery
Date Tue, 26 Apr 2011 21:09:27 GMT
In 0.7.X the cli waits for the schema to agree before returning, you should see...

Waiting for schema agreement...
... schemas agree across the cluster

Or if things fail
The schema has not settled in %d seconds; further migrations are ill-advised until it does.%nVersions
are %s%n

WRT the error, first guess is something in the schema has changed it's upsetting the log replay.
Given all the crazy i'd go with the nuclear option. 

On 27 Apr 2011, at 07:11, William Oberman wrote:

> In my test cluster I manged to jam up a cassandra server.  I figure the easy & failsafe
solution is to just boot a replacement node, but I thought I'd try a minute to either figure
out what I did, or try to figure out how to properly recover it before I lose my current state.
> The symptom = on startup I get an exception:
> ERROR 11:58:34,567 Exception encountered during startup.
> java.lang.IndexOutOfBoundsException: 6
>         at java.nio.HeapByteBuffer.get(
>         at org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(
>         at
>         at
>         at java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(
>         at java.util.concurrent.ConcurrentSkipListMap.findPredecessor(
>         at java.util.concurrent.ConcurrentSkipListMap.doPut(
>         at java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(
>         at org.apache.cassandra.db.ColumnFamily.addColumn(
>         at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(
>         at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(
>         at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(
>         at org.apache.cassandra.db.commitlog.CommitLog.recover(
>         at org.apache.cassandra.db.commitlog.CommitLog.recover(
>         at org.apache.cassandra.service.AbstractCassandraDaemon.setup(
>         at org.apache.cassandra.service.AbstractCassandraDaemon.activate(
>         at org.apache.cassandra.thrift.CassandraDaemon.main(
> Where things went wrong = I had been doing various testing and unit testing, as this
is my "proof of concept" cluster.  The unit tests in particular work by cloning a keyspace
as "keyspace_UUID" (to get a blank slate).  Because of various bugs in my code and configuration,
this left a fair amount of crud keyspaces by the time I got everything to pass.  So, I wrote
a script to drop all of the test keyspaces (the script had worked on a single node environment,
which was my first step before the cluster).  I think the CLI doesn't wait for schema propagation,
so the script confused the node I was talking to, as after it ran the schema UUIDs of that
node vs. the rest of the cluster didn't agree ("describe cluster" in the CLI).  And, it wasn't
fixing itself.  "nodetool loadbalance" said it would do a decommission/bootstrap, which I
thought might give the bad node a kick in the pants, so I tried it.  Afterwards, I ran "nodetool
ring" against all nodes and the problem node claimed all was "UP", but everything else listed
the problem node as "?" and everything else as UP (sadly, I either didn't check or can't remember
what "nodetool ring" said before loadbalance).  So, I shut down the problem node.  But, when
I tried to restart it, I got the error you see above.
> Not sure what was the worst/dumbest thing I did, but it's definitely unhappy now!

View raw message