cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts
Date Wed, 13 Jan 2016 17:33:40 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096604#comment-15096604
] 

Joel Knighton edited comment on CASSANDRA-10969 at 1/13/16 5:32 PM:
--------------------------------------------------------------------

I've pushed a patch that uses the local node's time instead of the stored generation for the
remote node when deciding whether a generation for a remote node has jumped too far ahead.
This patch implicitly depends on the fact that generation is initialized by time, but we implicitly
use this assumption elsewhere, so we're no worse off than before the change. Moreover, if
the method of generation selection changed, many of the tests would detect this incompatibility.

It is clear that using the stored generation doesn't work for long-running clusters. We have
no closer approximation for time of the remote node than the local time. Since we already
depend on time being reasonably well-synchronized, this seems safe enough to me.

Ideally, we'd have a better way to prevent this, but I think this issue should be fixed in
2.1/2.2 and this is the only non-intrusive solution that occurs to me.

||branch||testall||dtest||
|[10969-2.1|https://github.com/jkni/cassandra/tree/10969-2.1]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.1-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.1-dtest]|
|[10969-2.2|https://github.com/jkni/cassandra/tree/10969-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.2-dtest]|
|[10969-3.0|https://github.com/jkni/cassandra/tree/10969-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.0-dtest]|
|[10969-3.3|https://github.com/jkni/cassandra/tree/10969-3.3]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.3-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.3-dtest]|
|[10969-trunk|https://github.com/jkni/cassandra/tree/10969-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-trunk-dtest]|

EDIT: CI looks as clean as can be expected.

EDIT 2: Also, I pushed separate branches to better cover CI. The 2.1 branch is its own patch,
the 2.2 patch should merge cleanly through 2.2->3.0->3.3->trunk.



was (Author: jkni):
I've pushed a patch that uses the local node's time instead of the stored generation for the
remote node when deciding whether a generation for a remote node has jumped too far ahead.
This patch implicitly depends on the fact that generation is initialized by time, but we implicitly
use this assumption elsewhere, so we're no worse off than before the change. Moreover, if
the method of generation selection changed, many of the tests would detect this incompatibility.

It is clear that using the stored generation doesn't work for long-running clusters. We have
no closer approximation for time of the remote node than the local time. Since we already
depend on time being reasonably well-synchronized, this seems safe enough to me.

Ideally, we'd have a better way to prevent this, but I think this issue should be fixed in
2.1/2.2 and this is the only non-intrusive solution that occurs to me.

||branch||testall||dtest||
|[10969-2.1|https://github.com/jkni/cassandra/tree/10969-2.1]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.1-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.1-dtest]|
|[10969-2.2|https://github.com/jkni/cassandra/tree/10969-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-2.2-dtest]|
|[10969-3.0|https://github.com/jkni/cassandra/tree/10969-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.0-dtest]|
|[10969-3.3|https://github.com/jkni/cassandra/tree/10969-3.3]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.3-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-3.3-dtest]|
|[10969-trunk|https://github.com/jkni/cassandra/tree/10969-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-10969-trunk-dtest]|

EDIT: CI looks as clean as can be expected.


> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10969
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: 4-node Cassandra 2.1.1 cluster, each node running on a Linux 2.6.32-431.20.3.dl6.x86_64
VM
>            Reporter: T. David Hudson
>            Assignee: Joel Knighton
>            Priority: Minor
>             Fix For: 3.3, 2.1.x, 2.2.x, 3.0.x
>
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my control) restarted.
 The remaining nodes are logging errors like this:
>     "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local generation
= 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the one-year threshold
added for CASSANDRA-8113.  The system clocks are up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and 3.0.x seems
not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem, whence the minor
priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message