cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Harvey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-14424) Gossip EchoMessages not being handled somewhere after node restart
Date Sat, 28 Apr 2018 00:21:00 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Harvey updated CASSANDRA-14424:
-------------------------------------
    Description: 
Noticing this behaviour on a brand new 3.11.2 ring:
 # Restart a random node in the ring.
 # When that node comes back up, around 30% of the time it sees a single other node down.
No other node in the ring sees that node is down.
 # After 10-20 minutes, the DOWN node suddenly appears UP to the restarted node.

 

After digging through tracing logs, here's what I know:

 

The node seen as DOWN has not gone down, but simply hasn't been seen as UP yet. The restarted
node is attempting to `markAlive()` the target node. Relevant logs from the restarted node's
POV:

 

{{INFO [GossipStage:1] 2018-04-27 14:03:50,950 Gossiper.java:1053 - Node /10.0.225.147 has
restarted, now UP}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,969 StorageService.java:2292 - Node /10.0.225.147
state jump to NORMAL}}
 {{INFO [HANDSHAKE-/10.0.225.147] 2018-04-27 14:03:50,976 OutboundTcpConnection.java:560 -
Handshaking version with /10.0.225.147}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}

 

(note that despite the Gossip seeing the DOWN node as 'UP', nodetool status still shows it
as 'DOWN', as markAlive has not completed, and will not actually be seen as 'UP' for 20 more
minutes)

 

The restarted node is repeatedly sending Echo messages to the DOWN node as part of the `markAlive()`
call. The DOWN node is receiving those, and claims to be sending a response. However, the
restarted node is not marking the DOWN node as UP even after the DOWN node sends the Echo
response.

 

Relevant logs from the restarted node's POV:

 

{{TRACE [GossipStage:1] 2018-04-27 14:11:28,792 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 99248@/10.0.225.147}}
{{TRACE [GossipTasks:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 10.0.103.45 sending
GOSSIP_DIGEST_SYN to 99631@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 99632@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 10.0.103.45 sending
GOSSIP_DIGEST_ACK2 to 99633@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 99635@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:31,794 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 100348@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:33,750 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 101157@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:35,412 MessagingService.java:945 - 10.0.103.45 sending
ECHO to 101753@/10.0.225.147}}

 

 

Relevant logs from the DOWN node's POV:

 

{{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 EchoVerbHandler.java:39 - Sending a EchoMessage
reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 328389@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:17,679
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 329412@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:18,680
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 330185@/10.0.103.45}}

 

 

The metrics on the restarted node show that the MessagingService has a large number of TimeoutsPerHost
for the DOWN node, and all other nodes have 0 timeouts.

 

 

Eventually, the restarted node finally sees DOWN node as coming up, and it spams several UP
messages when this happens:

 

 

{{INFO [RequestResponseStage-7] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress /10.0.225.147
is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,211 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}

 

 

 

A tcpdump shows no packet loss or other oddities between these two hosts. The restarted node
is sending Echo messages, and the DOWN node is ACKing them.

 

The fact that a burst comes through at the very end suggests to me that perhaps the Echo messages
are getting queued up somewhere.

 

The issue happens roughly 30% of the time a given node in the ring is restarted.


  was:
Noticing this behaviour on a brand new 3.11.2 ring:
 # Restart a random node in the ring.
 # When that node comes back up, around 30% of the time it sees a single other node down.
No other node in the ring sees that node is down.
 # After 10-20 minutes, the DOWN node suddenly appears UP to the restarted node.

 

After digging through tracing logs, here's what I know:

 

The node seen as DOWN has not gone down, but simply hasn't been seen as UP yet. The restarted
node is attempting to `markAlive()` the target node. Relevant logs from the restarted node's
POV:

 

{{INFO [GossipStage:1] 2018-04-27 14:03:50,950 Gossiper.java:1053 - Node /10.0.225.147 has
restarted, now UP}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,969 StorageService.java:2292 - Node /10.0.225.147
state jump to NORMAL}}
 {{INFO [HANDSHAKE-/10.0.225.147] 2018-04-27 14:03:50,976 OutboundTcpConnection.java:560 -
Handshaking version with /10.0.225.147}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}
 {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}

 

(note that despite the Gossip seeing the DOWN node as 'UP', nodetool status still shows it
as 'DOWN', as markAlive has not completed, and will not actually be seen as 'UP' for 20 more
minutes)

 

The restarted node is repeatedly sending Echo messages to the DOWN node as part of the `markAlive()`
call. The DOWN node is receiving those, and claims to be sending a response. However, the
restarted node is not marking the DOWN node as UP even after the DOWN node sends the Echo
response.

 

Relevant logs from the restarted node's POV:

 

{{TRACE [GossipStage:1] 2018-04-27 14:11:28,792 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 99248@/10.0.225.147}}
{{TRACE [GossipTasks:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending GOSSIP_DIGEST_SYN to 99631@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 99632@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending GOSSIP_DIGEST_ACK2 to 99633@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 99635@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:31,794 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 100348@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:33,750 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 101157@/10.0.225.147}}
{{TRACE [GossipStage:1] 2018-04-27 14:11:35,412 MessagingService.java:945 - cass-r2-bravo-0d090fef242998a23.reddit.ue1.snooguts.net/10.0.103.45
sending ECHO to 101753@/10.0.225.147}}

 

 

Relevant logs from the DOWN node's POV:

 

{{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 EchoVerbHandler.java:39 - Sending a EchoMessage
reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 MessagingService.java:945 - cass-r2-bravo-01a9db9f868b4130d.reddit.ue1.snooguts.net/10.0.225.147
sending REQUEST_RESPONSE to 328389@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:17,679
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 MessagingService.java:945 - cass-r2-bravo-01a9db9f868b4130d.reddit.ue1.snooguts.net/10.0.225.147
sending REQUEST_RESPONSE to 329412@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:18,680
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
 {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 MessagingService.java:945 - cass-r2-bravo-01a9db9f868b4130d.reddit.ue1.snooguts.net/10.0.225.147
sending REQUEST_RESPONSE to 330185@/10.0.103.45}}

 

 

The metrics on the restarted node show that the MessagingService has a large number of TimeoutsPerHost
for the DOWN node, and all other nodes have 0 timeouts.

 

 

Eventually, the restarted node finally sees DOWN node as coming up, and it spams several UP
messages when this happens:

 

 

{{INFO [RequestResponseStage-7] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress /10.0.225.147
is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
 {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,211 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}

 

 

 

A tcpdump shows no packet loss or other oddities between these two hosts. The restarted node
is sending Echo messages, and the DOWN node is ACKing them.

 

The fact that a burst comes through at the very end suggests to me that perhaps the Echo messages
are getting queued up somewhere.

 

The issue happens roughly 30% of the time a given node in the ring is restarted.


> Gossip EchoMessages not being handled somewhere after node restart
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-14424
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14424
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra 3.11.2 - brand new ring.
> ubuntu 16.04
> AWS
>            Reporter: Jason Harvey
>            Priority: Major
>             Fix For: 3.11.x, 4.x
>
>
> Noticing this behaviour on a brand new 3.11.2 ring:
>  # Restart a random node in the ring.
>  # When that node comes back up, around 30% of the time it sees a single other node down.
No other node in the ring sees that node is down.
>  # After 10-20 minutes, the DOWN node suddenly appears UP to the restarted node.
>  
> After digging through tracing logs, here's what I know:
>  
> The node seen as DOWN has not gone down, but simply hasn't been seen as UP yet. The restarted
node is attempting to `markAlive()` the target node. Relevant logs from the restarted node's
POV:
>  
> {{INFO [GossipStage:1] 2018-04-27 14:03:50,950 Gossiper.java:1053 - Node /10.0.225.147
has restarted, now UP}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,969 StorageService.java:2292 - Node /10.0.225.147
state jump to NORMAL}}
>  {{INFO [HANDSHAKE-/10.0.225.147] 2018-04-27 14:03:50,976 OutboundTcpConnection.java:560
- Handshaking version with /10.0.225.147}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - Updating topology
for /10.0.225.147}}
>  
> (note that despite the Gossip seeing the DOWN node as 'UP', nodetool status still shows
it as 'DOWN', as markAlive has not completed, and will not actually be seen as 'UP' for 20
more minutes)
>  
> The restarted node is repeatedly sending Echo messages to the DOWN node as part of
the `markAlive()` call. The DOWN node is receiving those, and claims to be sending a response.
However, the restarted node is not marking the DOWN node as UP even after the DOWN node
sends the Echo response.
>  
> Relevant logs from the restarted node's POV:
>  
> {{TRACE [GossipStage:1] 2018-04-27 14:11:28,792 MessagingService.java:945 - 10.0.103.45
sending ECHO to 99248@/10.0.225.147}}
> {{TRACE [GossipTasks:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 10.0.103.45
sending GOSSIP_DIGEST_SYN to 99631@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 10.0.103.45
sending ECHO to 99632@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 10.0.103.45
sending GOSSIP_DIGEST_ACK2 to 99633@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 10.0.103.45
sending ECHO to 99635@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:31,794 MessagingService.java:945 - 10.0.103.45
sending ECHO to 100348@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:33,750 MessagingService.java:945 - 10.0.103.45
sending ECHO to 101157@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:35,412 MessagingService.java:945 - 10.0.103.45
sending ECHO to 101753@/10.0.225.147}}
>  
>  
> Relevant logs from the DOWN node's POV:
>  
> {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 EchoVerbHandler.java:39 - Sending a EchoMessage
reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 328389@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:17,679
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 329412@/10.0.103.45}}{{TRACE [GossipStage:1] 2018-04-27 14:18:18,680
EchoVerbHandler.java:39 - Sending a EchoMessage reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 MessagingService.java:945 - 10.0.225.147
sending REQUEST_RESPONSE to 330185@/10.0.103.45}}
>  
>  
> The metrics on the restarted node show that the MessagingService has a large number of
TimeoutsPerHost for the DOWN node, and all other nodes have 0 timeouts.
>  
>  
> Eventually, the restarted node finally sees DOWN node as coming up, and it spams several
UP messages when this happens:
>  
>  
> {{INFO [RequestResponseStage-7] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,211 Gossiper.java:1019 - InetAddress
/10.0.225.147 is now UP}}
>  
>  
>  
> A tcpdump shows no packet loss or other oddities between these two hosts. The restarted
node is sending Echo messages, and the DOWN node is ACKing them.
>  
> The fact that a burst comes through at the very end suggests to me that perhaps the Echo
messages are getting queued up somewhere.
>  
> The issue happens roughly 30% of the time a given node in the ring is restarted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message