cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts
Date Tue, 12 Jan 2016 18:12:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094418#comment-15094418
] 

Joel Knighton commented on CASSANDRA-10969:
-------------------------------------------

Sorry - I missed your reply.

I suspect the issue was that on restart, N2 first gossiped with N3 or N4 that contained the
old generation. This would have contaminated it and put it in the same state as before.

If N4 restarted and first gossiped with N1, it would have received the new generation. The
odds are then much better for N2 or N3 to gossip with a node with the correct generation on
restart.

It now seems clear that rolling restarts will eventually solve the issue based on which with
node gossip first occurs, but a single rolling restart may not be sufficient. My apologies
if my initial advice caused any pain.

The planned patch will remove the need for a rolling restart in the first place, solving the
issue. I'm testing it now.

Thanks for the detailed reports; it makes debugging the issue much easier.

> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10969
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: 4-node Cassandra 2.1.1 cluster, each node running on a Linux 2.6.32-431.20.3.dl6.x86_64
VM
>            Reporter: T. David Hudson
>            Assignee: Joel Knighton
>            Priority: Minor
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my control) restarted.
 The remaining nodes are logging errors like this:
>     "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local generation
= 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the one-year threshold
added for CASSANDRA-8113.  The system clocks are up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and 3.0.x seems
not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem, whence the minor
priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message