cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Goffinet (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN
Date Tue, 13 Dec 2011 23:21:30 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Goffinet updated CASSANDRA-3626:
--------------------------------------

             Reviewer: lenn0x
    Affects Version/s: 0.8.8
                       1.0.5
    
> Nodes can get stuck in UP state forever, despite being DOWN
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-3626
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.8, 1.0.5
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> This is a proposed phrasing for an upstream ticket named "Newly discovered nodes that
are down get stuck in UP state forever" (will edit w/ feedback until done):
> We have a observed a problem with gossip which, when you are bootstrapping a new node
(or replacing using the replace_token support), any node in the cluster which is Down at the
time the node is started, will be assumed to be Up and then *never ever* flapped back to Down
until you restart the node.
> This has at least two implications to replacing or bootstrapping new nodes when there
are nodes down in the ring:
> * If the new node happens to select a node listed as (UP but in reality is DOWN) as a
stream source, streaming will sit there hanging forever.
> * If that doesn't happen (by picking another host), it will instead finish bootstrapping
correctly, and begin servicing requests all the while thinking DOWN nodes are UP, and thus
routing requests to them, generating timeouts.
> The way to get out of this is to restart the node(s) that you bootstrapped.
> I have tested and confirmed the symptom (that the bootstrapped node things other nodes
are Up) using a fairly recent 1.0. The main debugging effort happened on 0.8 however, so all
details below refer to 0.8 but are probably similar in 1.0.
> Steps to reproduce:
> * Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster is
operative with one node removed.
> * Pick two random nodes A, and B. Shut them *both* off.
> * Wait for everyone to realize they are both off (for good measure).
> * Now, take node A and nuke it's data directories and re-start it, such that it comes
up w/ normal bootstrap (or use replace_token; didn't test that but should not affect it).
> * Watch how node A starts up, all the while believing node B is down, even though all
other nodes in the cluster agree that B is down and B is in fact still turned off.
> The mechanism by which it initially goes into Up state is that the node receives a gossip
response from any other node in the cluster, and GossipDigestAck2VerbHandler.doVerb() calls
Gossiper.applyStateLocally().
> Gossiper.applyStateLocally() doesn't have any local endpoint state for the cluster, so
the else statement at the end ("it's a new node") gets triggered and handleMajorStateChange()
is called. handleMajorStateChange() always calls markAlive(), unless the state is a dead state
(but "dead" here does not mean "not up", but refers to joining/hibernate etc).
> So at this point the node is up in the mind of the node you just bootstrapped.
> Now, in each gossip round doStatusCheck() is called, which iterates over all nodes (including
the one falsly Up) and among other things, calls FailureDetector.interpret() on each node.
> FailureDetector.interpret() is meant to update its sense of Phi for the node, and potentially
convict it. However there is a short-circuit at the top, whereby if we do not yet have any
arrival window for the node, we simply return immediately.
> Arrival intervals are only added as a result of a FailureDetector.report() call, which
never happens in this case because the initial endpoint state we added, which came from a
remote node that was up, had the latest version of the gossip state (so Gossiper.reportFailureDetector()
will never call report()).
> The result is that the node can never ever be convicted.
> Now, let's ignore for a moment the problem that a node that is actually Down will be
thought to be Up temporarily for a little while. That is sub-optimal, but let's aim for a
fix to the more serious problem in this ticket - which is that is stays up forever.
> Considered solutions:
> * When interpret() gets called and there is no arrival window, we could add a faked arrival
window far back in time to cause the node to have history and be marked down. This "works"
in the particular test case. The problem is that since we are not ourselves actively trying
to gossip to these nodes with any particular speed, it might take a significant time before
we get any kind of confirmation from someone else that it's actually Up in cases where the
node actually *is* Up, so it's not clear that this is a good idea.
> * When interpret() gets called and there is no arrival window, we can simply convict
it immediately. This has roughly similar behavior as the previous suggestion.
> * When interpret() gets called and there is no arrival window, we can add a faked arrival
window at the current time, which will allow it to be treated as Up until the usual time has
passed before we exceed the Phi conviction threshold.
> * When interpret() gets called and there is no arrival window, we can immediately convict
it, *and* schedule it for immediate gossip on the next round in order to try to ensure they
go Up quickly if they are indeed up. This has an effect of O(n) gossip traffic, as a special
case once during node start-up. While theoretically a problem, I personally thing we can ignore
it for now since it won't be a significant problem any time soon. However, this is more complicated
since the way we queue up messages is asynchronously to background connection attempts. We'd
have to make sure the initial gossip message actually gets sent on an open TCP connection
(I haven't confirmed whether this will be the case or not).
> The first three are simple to implement, possibly the fourth. But in all cases, I am
worried about potential negative consequences that I am not seeing.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message