cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Didier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
Date Tue, 22 Dec 2015 14:40:46 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068186#comment-15068186
] 

Didier commented on CASSANDRA-10371:
------------------------------------

Hi Stefania,

You are perfectly right ! I just fix my issue when you wrote your answer. My problem is that
in fact there is a lot of nodes impacted in this mess (not just one : Multi DC Europe / US).



I have setup these entries in the log4j-server.properties in one node :

{code}
log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
{/code}

With this trick I have found the culpurit nodes with a simple tail in the system.log :

I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 "192.168.136.28"

{code}
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 40) Received
a GossipDigestSynMessage from /10.0.2.110
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 71) Gossip
syn digests are : /10.10.102.97:1448271725:7650177 /10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527
/10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388
{code}

Every time I found a match with a phantom node IP in the Gossip syn digests, I run this on
the affected node (in this example 10.0.2.110) : 

{code}
nodetool drain && /etc/init.d/cassandra restart
{/code}

After some nodes (15 nodes), I check if I get some entries in my system.log with the phantom
nodes ... and voila ! 
No more phantom nodes.

Thanks for your help ;)

Didier

> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
>                 Key: CASSANDRA-10371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Brandon Williams
>            Assignee: Stefania
>            Priority: Minor
>
> This may apply to other dead states as well.  Dead states should be expired after 3 days.
 In the case of decom we attach a timestamp to let the other nodes know when it should be
expired.  It has been observed that sometimes a subset of nodes in the cluster never expire
the state, and through heap analysis of these nodes it is revealed that the epstate.isAlive
check returns true when it should return false, which would allow the state to be evicted.
 This may have been affected by CASSANDRA-8336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message