Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=content-type
	:mime-version:subject:from:in-reply-to:date
	:content-transfer-encoding:message-id:references:to; q=dns; s=
	thelastpickle.com; b=xnDQ/YEDOHZjCRqupgcXQmKwbV2LfNbdnbg2Fi7av+f
	JgUIFhJ9PPeFxNgw3J4oPBcL5m6xj/Pp8XiwozAfZfLaymeCGKhSxBGWtJHBYvww
	hFGsAPt9PsmKG+tJc8ewpvl6quhYpyWkKZOOS0bLjUiyIIe4CooI8u9fTUalA+GY
	=
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: practice failure recovery
From: aaron morton <aaron@thelastpickle.com>
In-Reply-To: <BANLkTi=wo6ngvW-L20zwN8+7Mdxjz-Ojjg@mail.gmail.com>
Date: Wed, 27 Apr 2011 09:09:27 +1200
Content-Transfer-Encoding: quoted-printable
Message-Id: <8654407B-C2CB-463C-9FF5-074A2A2D81DF@thelastpickle.com>
References: <BANLkTi=wo6ngvW-L20zwN8+7Mdxjz-Ojjg@mail.gmail.com>
To: user@cassandra.apache.org

In 0.7.X the cli waits for the schema to agree before returning, you =
should see...

Waiting for schema agreement...
... schemas agree across the cluster

Or if things fail
The schema has not settled in %d seconds; further migrations are =
ill-advised until it does.%nVersions are %s%n

WRT the error, first guess is something in the schema has changed it's =
upsetting the log replay. Given all the crazy i'd go with the nuclear =
option.=20

Aaron
=20
On 27 Apr 2011, at 07:11, William Oberman wrote:

> In my test cluster I manged to jam up a cassandra server.  I figure =
the easy & failsafe solution is to just boot a replacement node, but I =
thought I'd try a minute to either figure out what I did, or try to =
figure out how to properly recover it before I lose my current state.
>=20
> The symptom =3D on startup I get an exception:
> ERROR 11:58:34,567 Exception encountered during startup.
> java.lang.IndexOutOfBoundsException: 6
>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
>         at =
org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUI=
DType.java:56)
>         at =
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)=

>         at =
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)=

>         at =
java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compa=
reTo(ConcurrentSkipListMap.java:606)
>         at =
java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipL=
istMap.java:685)
>         at =
java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.jav=
a:864)
>         at =
java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListM=
ap.java:1893)
>         at =
org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
>         at =
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFa=
milySerializer.java:130)
>         at =
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySer=
ializer.java:120)
>         at =
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowM=
utation.java:380)
>         at =
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
>         at =
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
>         at =
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassand=
raDaemon.java:173)
>         at =
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCass=
andraDaemon.java:314)
>         at =
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)
>=20
> Where things went wrong =3D I had been doing various testing and unit =
testing, as this is my "proof of concept" cluster.  The unit tests in =
particular work by cloning a keyspace as "keyspace_UUID" (to get a blank =
slate).  Because of various bugs in my code and configuration, this left =
a fair amount of crud keyspaces by the time I got everything to pass.  =
So, I wrote a script to drop all of the test keyspaces (the script had =
worked on a single node environment, which was my first step before the =
cluster).  I think the CLI doesn't wait for schema propagation, so the =
script confused the node I was talking to, as after it ran the schema =
UUIDs of that node vs. the rest of the cluster didn't agree ("describe =
cluster" in the CLI).  And, it wasn't fixing itself.  "nodetool =
loadbalance" said it would do a decommission/bootstrap, which I thought =
might give the bad node a kick in the pants, so I tried it.  Afterwards, =
I ran "nodetool ring" against all nodes and the problem node claimed all =
was "UP", but everything else listed the problem node as "?" and =
everything else as UP (sadly, I either didn't check or can't remember =
what "nodetool ring" said before loadbalance).  So, I shut down the =
problem node.  But, when I tried to restart it, I got the error you see =
above.
>=20
> Not sure what was the worst/dumbest thing I did, but it's definitely =
unhappy now!