cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Didier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
Date Mon, 21 Dec 2015 16:42:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066681#comment-15066681
] 

Didier commented on CASSANDRA-10371:
------------------------------------

Hi Stefania,

Thanks to your quick answer.

I attach TRACE log for phantom node 192.168.128.28 :

3614313:TRACE [GossipStage:2] 2015-12-21 17:21:19,984 Gossiper.java (line 1155) requestAll
for /192.168.128.28
3616877:TRACE [GossipStage:2] 2015-12-21 17:21:20,123 FailureDetector.java (line 205) reporting
/192.168.128.28
3616881:TRACE [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 986) Adding endpoint
state for /192.168.128.28
3616892:DEBUG [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 999) Not marking
/192.168.128.28 alive due to dead state
3616897:TRACE [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 958) marking as
down /192.168.128.28
3616908: INFO [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 962) InetAddress
/192.168.128.28 is now DOWN
3616912:DEBUG [GossipStage:2] 2015-12-21 17:21:20,126 MessagingService.java (line 397) Resetting
pool for /192.168.128.28
3616937:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616955:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616956:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616958:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616976:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616977:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616979:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616992:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616993:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3616995:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3617008:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring
state change for dead or unknown endpoint: /192.168.128.28
3617317:DEBUG [GossipStage:2] 2015-12-21 17:21:20,143 StorageService.java (line 1699) Node
/192.168.128.28 state left, tokens [100310405581336885248896672411729131592, ....... , 99937615223192795414082780446763257757,
99975703478103230193804512094895677044]
3617321:DEBUG [GossipStage:2] 2015-12-21 17:21:20,144 Gossiper.java (line 1463) adding expire
time for endpoint : /192.168.128.28 (1449830784335)
3617337: INFO [GossipStage:2] 2015-12-21 17:21:20,145 StorageService.java (line 1781) Removing
tokens [100310405581336885248896672411729131592, 100598580285540169800869916837708042668,
....., 99743016911284542884064313061048682083, 99937615223192795414082780446763257757, 99975703478103230193804512094895677044]
for /192.168.128.28
3617362:DEBUG [GossipStage:2] 2015-12-21 17:21:20,146 MessagingService.java (line 795) Resetting
version for /192.168.128.28
3617367:DEBUG [GossipStage:2] 2015-12-21 17:21:20,147 Gossiper.java (line 410) removing endpoint
/192.168.128.28
3631829:TRACE [GossipTasks:1] 2015-12-21 17:21:20,964 Gossiper.java (line 492) Gossip Digests
are : /10.10.102.96:1448271659:7409547 /10.0.102.190:1448275278:7395730 /10.10.102.94:1448271818:7409091
/192.168.128.23:1450707984:20939 /10.10.102.8:1448271443:7409972 /10.0.2.97:1448276012:7395072
/10.0.102.93:1448274183:7401036 /192.168.136.26:1450708061:20700 /192.168.136.23:1450708062:20695
/10.10.2.239:1448533274:6614346 /10.0.102.206:1448273613:7402527 /10.0.102.92:1448274024:7401356
/10.0.2.143:1448275597:7396779 /10.10.2.11:1448270678:7412474 /10.10.2.145:1448271264:7410576
/192.168.128.32:1449151772:4740947 /10.0.2.5:1449149504:4746745 /192.168.128.26:1450707983:20947
/192.168.136.22:1450708061:20700 /10.0.102.94:1448274372:7400487 /10.0.2.109:1448276688:7393112
/10.10.2.18:1448271203:7410982 /10.10.102.49:1448271974:7408616 /10.10.102.192:1448271561:7409839
/192.168.128.31:1449151700:4741174 /10.0.102.90:1448273911:7401771 /192.168.128.21:1450714541:1013
/10.0.102.138:1448273504:7402737 /10.0.2.107:1448276554:7393892 /10.0.2.105:1448276464:7393834
/10.10.2.10:1448270541:7412796 /10.10.2.13:1448270948:7411786 /10.10.102.95:1448271895:7408758
/192.168.128.30:1450427261:872385 /10.0.2.142:1448275345:7397252 /10.0.102.113:1448274816:7398949
/10.10.102.97:1448271725:7409279 /10.10.2.23:1448271352:7410212 /192.168.136.21:1450708063:20699
/192.168.136.25:1450708061:20699 /192.168.136.24:1450708064:20688 /10.0.2.110:1448276759:7393030
/192.168.128.25:1450707984:20942 /10.0.102.125:1448275195:7397877 /10.0.2.36:1448276280:7394606
/10.10.2.4:1448271033:7410975 /10.0.2.4:1448275709:7396295 /192.168.128.28:1449485330:259526
/10.10.102.66:1448271505:7409736 /192.168.128.22:1450707985:20936 /10.10.102.29:1448951289:5348480
/10.10.2.121:1448271104:7410985 /10.0.2.108:1448276619:7393387 /10.0.102.247:1448275119:7398016
/10.0.2.226:1448276163:7394860 /10.0.102.95:1448274450:7400161 /192.168.128.29:1449151797:4740847
/10.0.102.32:1448274522:7398608 /10.0.102.88:1448273810:7402146 /10.0.2.166:1448276372:7394409
/10.10.102.38:1448961691:5316954 /192.168.128.24:1450707985:20932
3632204:DEBUG [GossipTasks:1] 2015-12-21 17:21:20,983 Gossiper.java (line 741) time is expiring
for endpoint : /192.168.128.28 (1449830784335)
3632208:DEBUG [GossipTasks:1] 2015-12-21 17:21:20,985 Gossiper.java (line 383) evicting /192.168.128.28
from gossip
3832305:TRACE [ReadStage:319] 2015-12-21 17:21:08,855 ColumnFamilyStore.java (line 1652) scanned
192.168.128.28
3853098:TRACE [ReadStage:322] 2015-12-21 17:21:09,978 ColumnFamilyStore.java (line 1652) scanned
192.168.128.28
3973963:DEBUG [GossipTasks:1] 2015-12-21 17:21:05,096 Gossiper.java (line 755) 60000 elapsed,
/192.168.128.28 gossip quarantine over


I can see culprit IPs in the GossipDigestSynVerbHandler : 192.168.128.28 / 192.168.136.28
(2 others are missing 192.168.128.27 and 192.168.136.27)

I have checked in all system.peers on each node in each DC of our cluster, and none of these
IP are still presents. The NTP seems to be OK and we don't have desynchronisation.

The node 192.168.128.28 is in Gossip quarantine mode and every n seconds, something tries
to remove it without success. The node seems to have reach a time limit (time is expiring
for endpoint : /192.168.128.28 (1449830784335))

I have tried to assassinate it via JMX, rolling restart one DC (we have 4 DCs in this cluster),
I also tried the JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false" but everything is
unsuccessful.

If you have any advise, I'm in !

Best regards,

Didier



> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
>                 Key: CASSANDRA-10371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Brandon Williams
>            Assignee: Stefania
>            Priority: Minor
>
> This may apply to other dead states as well.  Dead states should be expired after 3 days.
 In the case of decom we attach a timestamp to let the other nodes know when it should be
expired.  It has been observed that sometimes a subset of nodes in the cluster never expire
the state, and through heap analysis of these nodes it is revealed that the epstate.isAlive
check returns true when it should return false, which would allow the state to be evicted.
 This may have been affected by CASSANDRA-8336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message