Hi.

I think that the "unsafeAssassinateEndpoint" was the good solution here. I was going to lead you to this solution after reading the first part of your message.

"Does anyone know why the dead nodes still appear when we run "nodetool gossipinfo" but they don't when we run "describe cluster" from the CLI?"

That's a good thing. Gossiper just keep this information for a while (7 or 10 days by default off the top off my head), but this doesn't harm your cluster in any ways, but having "UNREACHABLE" nodes could have been annoying. By the way gossipinfo shows you those nodes as "STATUS:LEFT" which is good. I am quite sure that this status changed when you used the jmx "unsafeAssassinateEndpoint".

"do a full cluster restart (I presume that means a rolling restart - not shut-down the entire cluster right???). "

A full restart => entire cluster down => down time. It is precisely *not* a rolling restart.

To conclude I would say that your cluster seems healthy now (from what I can see), you have no more ghost nodes and nothing to do. Just wait a week or so and look for gossipinfo again.


2013/5/22 Vasileios Vlachos <vasileiosvlachos@gmail.com>
Hello All,

A while ago we had 3 cassandra nodes on Amazon. At some point we decided to buy some servers and deploy cassandra there. The problem is that since then we have a list of dead IPs listed as UNREACHABLE nodes when we run describe cluster on cassandra-cli.

I have seen other posts which describe similar issues, and the bottom line is "it's harmless but if you want to get rid of it do a full cluster restart" (I presume that means a rolling restart - not shut-down the entire cluster right???). Anyway... 

We also came across another solution: Install "libmx4j-java", uncomment the respective line on "/etc/default/cassandra", restart the node, go to " http://cassandra_node:8081/mbean?objectname=org.apache.cassandra.net%3Atype%3DGossiper", type in the dead IP/IPs next to the "unsafeAssassinateEndpoint" and invoke it. So we did that on one of the nodes for the list of dead IPs. After running "describe cluster" on the CLI on every node, we noticed that there were no UNREACHABLE nodes and everything looked OK.  

However, when we run "nodetool gossipinfo" we get the following output:

RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.76851457173E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,56713727820156410577229101238628035243
REMOVAL_COORDINATOR:REMOVER,113427455640312821154458202477256070486
STATUS:LEFT,42537039300520238181471502256297362072,1369471488145
REMOVAL_COORDINATOR:REMOVER,1
STATUS:LEFT,42537092606577173116506557155915918934,1369471275829
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.75649392881E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,85070591730234615865843651857942052863
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.71158702006E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,141784319550391026443072753096570088105
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.73163150773E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,113427455640312821154458202477256070486
REMOVAL_COORDINATOR:REMOVER,1
STATUS:LEFT,42537092606577173116506557155915918934,1369471567719
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.72271268395E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,28356863910078205288614550619314017621
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
LOAD:2.71494331357E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,0

Does anyone know why the dead nodes still appear when we run "nodetool gossipinfo" but they don't when we run "describe cluster" from the CLI?

Thank you in advance for your help,

Vasilis