cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Low (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8336) Add shutdown gossip state to prevent timeouts during rolling restarts
Date Mon, 11 May 2015 17:17:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538198#comment-14538198
] 

Richard Low commented on CASSANDRA-8336:
----------------------------------------

How big is your largest QA cluster? I did extensive manual tests to verify this fixes the
issue in a large cluster.

> Add shutdown gossip state to prevent timeouts during rolling restarts
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-8336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.15, 2.1.5
>
>         Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 8366-v5.txt
>
>
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here is that
this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things
out.  This happens due to gossip propagation time and variance; if node X shuts down and sends
the message to Y, but Z has a greater gossip version than Y for X and has not yet received
the message, it can initiate gossip with Y and thus mark X alive again.  I propose quarantining
to solve this, however I feel it should be a -D parameter you have to specify, so as not to
destroy current dev and test practices, since this will mean a node that shuts down will not
be able to restart until the quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message