cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message
Date Fri, 20 Feb 2015 18:11:12 GMT


Brandon Williams commented on CASSANDRA-8336:

bq. The shutting down node might as well set the version of the shutdown state to Integer.MAX_VALUE
since receiving nodes will blindly use that.

Well, as I explained in an earlier comment, this isn't really much of an optimization, and
if the nodes receive the RPC first, we have to modify it on the receiver anyway, so it seems
cleaner to reuse markAsShutdown for both.

bq. Why does it increment the generation number? We call Gossiper.instance.start with a new
generation number set to the current time so it would make sense to use that.

Because start calls maybeInitializeLocalState which won't actually add the current time heartbeat,
since as the method says, it will only add the new state if the gossiper has never been started
before (meaning we don't know our own state.)

bq. If hit 'Unable to gossip with any seeds’ on replace, it shuts down the gossiper. This
throws an AssertionError in addLocalApplicationState since the local epState is null.

Hmm, probably the best thing to do there is change it from stop to stopForLeaving (though
that method needs a better name now) since there's no point in sending shutdown notifications
for a node that isn't a member.

> Quarantine nodes after receiving the gossip shutdown message
> ------------------------------------------------------------
>                 Key: CASSANDRA-8336
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.13
>         Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here is that
this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things
out.  This happens due to gossip propagation time and variance; if node X shuts down and sends
the message to Y, but Z has a greater gossip version than Y for X and has not yet received
the message, it can initiate gossip with Y and thus mark X alive again.  I propose quarantining
to solve this, however I feel it should be a -D parameter you have to specify, so as not to
destroy current dev and test practices, since this will mean a node that shuts down will not
be able to restart until the quarantine expires.

This message was sent by Atlassian JIRA

View raw message