Well it seems I have nothing like this when I run a $grep "Unknown host" /var/log/cassandra/system.log.

This issue was reported in 1.2.1 and commited to the trunk. It may have been fixed in 1.2.2 even if I can't see the release version from the jira nor can I see it in the changelog.

Thanks again even if I am still in troubles.


2013/3/14 Michal Michalski <michalm@opera.com>
Just to make it clear: This bug will occur on single-DC configuration too.

In our case it resulted in Exception like this at the very end of node startup:

ERROR [WRITE-/<SOME-IP>] 2013-02-27 12:14:55,433 CassandraDaemon.java (line 133) Exception in thread Thread[WRITE-/<SOME-IP>,5,main]
java.lang.RuntimeException: Unknown host /0.0.0.0 with no default configured

It will happen if your rpc_address is set to 0.0.0.0.

M.

W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze:

Thanks for this pointer but I don't think this is the source of our problem
since we use 1 data center and Ec2Snitch.



2013/3/14 Jean-Armel Luce <jaluce06@gmail.com>

Hi Alain,

Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299

A patch is provided with this ticket.

Regards.

Jean Armel


2013/3/14 Alain RODRIGUEZ <arodrime@gmail.com>

Hi

We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2.

This has been a disaster. I just switch one node to 1.2.2, updated its
configuration (cassandra.yaml / cassandra-env.sh) and restart it.

It resulted on error on all the 5 remaining 1.1.6 nodes :

ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[RequestResponseStage:2,5,main]
java.io.IOError: java.io.EOFException
         at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
         at
org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155)
         at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
         at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
         at java.io.DataInputStream.readFully(DataInputStream.java:180)
         at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)
         at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
         at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
         ... 6 more

I had this a lot of times, and my entire cluster wasn't reachable by our
4 clients (phpCassa, Hector, Cassie, Helenus)

I decommissioned the 1.2.2 node to get our cluster answering queries. It
worked.

Then I tried to replace this node by a new C*1.1.6 one with the same
token as the previous node decommissioned. The node joined the ring and
before getting any data switch to normal status.

In all the other nodes I had :

ERROR [MutationStage:8] 2013-03-14 10:21:01,288
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[MutationStage:8,5,main]
java.lang.AssertionError
         at
org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304)
         at
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:371)
         at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
         at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
         at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)

So I decommissioned this new 1.1.6 node and we are now running with 5
servers, not balanced along the ring, without any possibility of adding
nodes, nor upgradinc C* version.

We are quite desperate over here.

If someone has any idea of what could happened and how to stabilize the
cluster, it will be very appreciated.

It's quite an emergency since we can't add nodes and are under heavy load.