cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message
Date Mon, 06 Apr 2015 18:24:13 GMT


Brandon Williams commented on CASSANDRA-8336:

There is one last wrinkle with this: if a bootstrap is started but then aborted by the operator,
the shutdown message makes it part of the ring in that it will be persisted to system.peers,
which then confuses clients.  I believe the same will happen with an aborted replace_address
as well, or any non-normal state which gets aborted and then sends the shutdown state.  One
solution might be to have Gossiper's stop() examine its own state and compare against dead
states and the joining state to decide whether to send the shutdown state or not.

> Quarantine nodes after receiving the gossip shutdown message
> ------------------------------------------------------------
>                 Key: CASSANDRA-8336
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.15
>         Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here is that
this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things
out.  This happens due to gossip propagation time and variance; if node X shuts down and sends
the message to Y, but Z has a greater gossip version than Y for X and has not yet received
the message, it can initiate gossip with Y and thus mark X alive again.  I propose quarantining
to solve this, however I feel it should be a -D parameter you have to specify, so as not to
destroy current dev and test practices, since this will mean a node that shuts down will not
be able to restart until the quarantine expires.

This message was sent by Atlassian JIRA

View raw message