incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: Unreachable Nodes
Date Wed, 22 May 2013 13:35:33 GMT
I had to face this too, but precisely the "unsafeAssassinateEndpoint"
removed the "UNREACHABLE" nodes (from describe cluster - CLI). After that,
I had these ghost host marked as "STATUS:LEFT" on gossipinfo (nodetool) and
my truncate could run properly. But this is only my own experience, and you
might want listen to Brian, who has probably more experience than I do, and
restart your cluster. I guess it also depends on your need of using
truncate and whether you can afford a down time or not.

But I really think that, at this point, you can run a truncate.

Alain


2013/5/22 Brian Tarbox <tarbox@cabotresearch.com>

> Have to disagree with the "does no harm" comment just a tiny bit.  I had a
> similar situation recently and coincidentally needed to do a CF truncate.
>  The system rejected the request saying that not all nodes were up.
>  Nodetool ring said everyone was up but nodetool gossipinfo said there were
> vestiges of dead nodes still hanging around.  I ended up restarting the
> entire cluster which cleared the issue.
>
> Brian
>
>
> On Wed, May 22, 2013 at 6:46 AM, Vasileios Vlachos <
> vasileiosvlachos@gmail.com> wrote:
>
>> Hello,
>>
>> Thanks for your fast response. That makes sense. I'll just keep an eye on
>> it then.
>>
>> Many thanks,
>>
>> Vasilis
>>
>>
>> On Wed, May 22, 2013 at 10:54 AM, Alain RODRIGUEZ <arodrime@gmail.com>wrote:
>>
>>> Hi.
>>>
>>> I think that the "unsafeAssassinateEndpoint" was the good solution here.
>>> I was going to lead you to this solution after reading the first part of
>>> your message.
>>>
>>> "Does anyone know why the dead nodes still appear when we run "nodetool
>>> gossipinfo" but they don't when we run "describe cluster" from the CLI?"
>>>
>>>  That's a good thing. Gossiper just keep this information for a while (7
>>> or 10 days by default off the top off my head), but this doesn't harm your
>>> cluster in any ways, but having "UNREACHABLE" nodes could have been
>>> annoying. By the way gossipinfo shows you those nodes as "STATUS:LEFT"
>>> which is good. I am quite sure that this status changed when you used the
>>> jmx "unsafeAssassinateEndpoint".
>>>
>>> "do a full cluster restart (I presume that means a rolling restart - not
>>> shut-down the entire cluster right???). "
>>>
>>> A full restart => entire cluster down => down time. It is precisely
>>> *not* a rolling restart.
>>>
>>> To conclude I would say that your cluster seems healthy now (from what I
>>> can see), you have no more ghost nodes and nothing to do. Just wait a week
>>> or so and look for gossipinfo again.
>>>
>>>
>>> 2013/5/22 Vasileios Vlachos <vasileiosvlachos@gmail.com>
>>>
>>>> Hello All,
>>>>
>>>> A while ago we had 3 cassandra nodes on Amazon. At some point we
>>>> decided to buy some servers and deploy cassandra there. The problem is that
>>>> since then we have a list of dead IPs listed as UNREACHABLE nodes when we
>>>> run describe cluster on cassandra-cli.
>>>>
>>>> I have seen other posts which describe similar issues, and the bottom
>>>> line is "it's harmless but if you want to get rid of it do a full cluster
>>>> restart" (I presume that means a rolling restart - not shut-down the entire
>>>> cluster right???). Anyway...
>>>>
>>>> We also came across another solution: Install "libmx4j-java", uncomment
>>>> the respective line on "/etc/default/cassandra", restart the node, go to
"
>>>> http://cassandra_node:8081/mbean?objectname=org.apache.cassandra.net%3Atype%3DGossiper",
>>>> type in the dead IP/IPs next to the "unsafeAssassinateEndpoint" and invoke
>>>> it. So we did that on one of the nodes for the list of dead IPs. After
>>>> running "describe cluster" on the CLI on every node, we noticed that there
>>>> were no UNREACHABLE nodes and everything looked OK.
>>>>
>>>> However, when we run "nodetool gossipinfo" we get the following output:
>>>>
>>>> /10.1.32.97
>>>>  RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.76851457173E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,56713727820156410577229101238628035243
>>>> /10.128.16.111
>>>> REMOVAL_COORDINATOR:REMOVER,113427455640312821154458202477256070486
>>>> STATUS:LEFT,42537039300520238181471502256297362072,1369471488145
>>>> /10.128.16.110
>>>> REMOVAL_COORDINATOR:REMOVER,1
>>>> STATUS:LEFT,42537092606577173116506557155915918934,1369471275829
>>>> /10.1.32.100
>>>> RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.75649392881E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,85070591730234615865843651857942052863
>>>> /10.1.32.101
>>>> RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.71158702006E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,141784319550391026443072753096570088105
>>>> /10.1.32.98
>>>> RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.73163150773E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,113427455640312821154458202477256070486
>>>> /10.128.16.112
>>>> REMOVAL_COORDINATOR:REMOVER,1
>>>> STATUS:LEFT,42537092606577173116506557155915918934,1369471567719
>>>> /10.1.32.99
>>>> RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.72271268395E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,28356863910078205288614550619314017621
>>>> /10.1.32.96
>>>> RELEASE_VERSION:1.0.11
>>>> SCHEMA:b1116df0-b3dd-11e2-0000-16fe4da5dbff
>>>> LOAD:2.71494331357E11
>>>> RPC_ADDRESS:0.0.0.0
>>>> STATUS:NORMAL,0
>>>>
>>>> Does anyone know why the dead nodes still appear when we run "nodetool
>>>> gossipinfo" but they don't when we run "describe cluster" from the CLI?
>>>>
>>>> Thank you in advance for your help,
>>>>
>>>> Vasilis
>>>>
>>>
>>>
>>
>

Mime
View raw message