incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Tarbox <tar...@cabotresearch.com>
Subject Re: Unreachable Nodes
Date Wed, 22 May 2013 12:10:54 GMT
Have to disagree with the "does no harm" comment just a tiny bit.  I had a
similar situation recently and coincidentally needed to do a CF truncate.
 The system rejected the request saying that not all nodes were up.
 Nodetool ring said everyone was up but nodetool gossipinfo said there were
vestiges of dead nodes still hanging around.  I ended up restarting the
entire cluster which cleared the issue.

Brian


On Wed, May 22, 2013 at 6:46 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Hello,
>
> Thanks for your fast response. That makes sense. I'll just keep an eye on
> it then.
>
> Many thanks,
>
> Vasilis
>
>
> On Wed, May 22, 2013 at 10:54 AM, Alain RODRIGUEZ <arodrime@gmail.com>wrote:
>
>> Hi.
>>
>> I think that the "unsafeAssassinateEndpoint" was the good solution here.
>> I was going to lead you to this solution after reading the first part of
>> your message.
>>
>> "Does anyone know why the dead nodes still appear when we run "nodetool
>> gossipinfo" but they don't when we run "describe cluster" from the CLI?"
>>
>>  That's a good thing. Gossiper just keep this information for a while (7
>> or 10 days by default off the top off my head), but this doesn't harm your
>> cluster in any ways, but having "UNREACHABLE" nodes could have been
>> annoying. By the way gossipinfo shows you those nodes as "STATUS:LEFT"
>> which is good. I am quite sure that this status changed when you used the
>> jmx "unsafeAssassinateEndpoint".
>>
>> "do a full cluster restart (I presume that means a rolling restart - not
>> shut-down the entire cluster right???). "
>>
>> A full restart => entire cluster down => down time. It is precisely *not*
>> a rolling restart.
>>
>> To conclude I would say that your cluster seems healthy now (from what I
>> can see), you have no more ghost nodes and nothing to do. Just wait a week
>> or so and look for gossipinfo again.
>>
>>
>> 2013/5/22 Vasileios Vlachos <vasileiosvlachos@gmail.com>
>>
>>> Hello All,
>>>
>>> A while ago we had 3 cassandra nodes on Amazon. At some point we decided
>>> to buy some servers and deploy cassandra there. The problem is that since
>>> then we have a list of dead IPs listed as UNREACHABLE nodes when we run
>>> describe cluster on cassandra-cli.
>>>
>>> I have seen other posts which describe similar issues, and the bottom
>>> line is "it's harmless but if you want to get rid of it do a full cluster
>>> restart" (I presume that means a rolling restart - not shut-down the entire
>>> cluster right???). Anyway...
>>>
>>> We also came across another solution: Install "libmx4j-java", uncomment
>>> the respective line on "/etc/default/cassandra", restart the node, go to "
>>> http://cassandra_node:8081/mbean?objectname=org.apache.cassandra.net%3Atype%3DGossiper",
>>> type in the dead IP/IPs next to the "unsafeAssassinateEndpoint" and invoke
>>> it. So we did that on one of the nodes for the list of dead IPs. After
>>> running "describe cluster" on the CLI on every node, we noticed that there
>>> were no UNREACHABLE nodes and everything looked OK.
>>>
>>> However, when we run "nodetool gossipinfo" we get the following output:
>>>
>>> /10.1.32.97
>>>  RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.76851457173E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,56713727820156410577229101238628035243
>>> /10.128.16.111
>>> REMOVAL_COORDINATOR:REMOVER,113427455640312821154458202477256070486
>>> STATUS:LEFT,42537039300520238181471502256297362072,1369471488145
>>> /10.128.16.110
>>> REMOVAL_COORDINATOR:REMOVER,1
>>> STATUS:LEFT,42537092606577173116506557155915918934,1369471275829
>>> /10.1.32.100
>>> RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.75649392881E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,85070591730234615865843651857942052863
>>> /10.1.32.101
>>> RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.71158702006E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,141784319550391026443072753096570088105
>>> /10.1.32.98
>>> RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.73163150773E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,113427455640312821154458202477256070486
>>> /10.128.16.112
>>> REMOVAL_COORDINATOR:REMOVER,1
>>> STATUS:LEFT,42537092606577173116506557155915918934,1369471567719
>>> /10.1.32.99
>>> RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.72271268395E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,28356863910078205288614550619314017621
>>> /10.1.32.96
>>> RELEASE_VERSION:1.0.11
>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>> LOAD:2.71494331357E11
>>> RPC_ADDRESS:0.0.0.0
>>> STATUS:NORMAL,0
>>>
>>> Does anyone know why the dead nodes still appear when we run "nodetool
>>> gossipinfo" but they don't when we run "describe cluster" from the CLI?
>>>
>>> Thank you in advance for your help,
>>>
>>> Vasilis
>>>
>>
>>
>

Mime
View raw message