cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2603) node stuck in 'Down' in nodetool ring, until disablegossip/enablegossip flapped it back into submission
Date Wed, 04 May 2011 20:34:03 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028942#comment-13028942
] 

Peter Schuller commented on CASSANDRA-2603:
-------------------------------------------

Ok, I don't have a good overall view of how gossip fits together. But I did have somewhat
of a look looking for a potential race. Take the following with a grain of salt and as a stream-of-consciousness
kind of thing which may or may not help.

So NodeCmd's criteria for showing "Down" is that the endpoint (1) is not in the set of live
nodes obtained from the probe, and (2) it is in fact contained in dead nodes from probe.getUnreachableNodes()
(otherwise ? is printed).

getUnreachableNodes() seems to translate, via stringificiation, to Gossiper.instance.getUnreachableMembers()
which is just unreachableEndpoints_.getSet(). So other than the potential for incidental races
in between nodetool and the gossiper, which is not the issue here, it seems to me that the
'Down' should be reflecting the actual existence of the node in unreachableEndpoints_.

This is where I'm starting to be unsure of how things fit together. On the one hand, it seems
there's an attempt to do all the gossiping logic in the GossipTask which runs on StorageService.scheduledTasks.
For example, doStatusCheck() calls FailureDetector.instance.interpret() instead of having
the failure detector call the Gossiper.

However, If I'm tracing things right there does seem to be some calls to the gossiper directly.
GossipDigestAckVerbHandler calls Gossiper.instance.{notifyFailureDetector,applyStateLocally}.

I went through (but not very carefully) the sites where unreachableEndpoints_ got modified,
and markAlive() is one caller that can result in a node being put in unreachable via isAlive().
applyStateLocally() is indeed a caller of markAlive(). markAlive is also where the "is now
UP" message is printed that I had in syslog in this case.

A node being downed as part of doStatusCheck() should be marked as such by interpret() on
the failure detector, which in turns does convict() on the gossiper, which calls isAlive()
with false. In the alive case, it's added to live endpoints and then removed from unreachable.


While this seems subject to races, I can't quite see how it ever actually matters. A node
will be briefly in neither live nor unreachable, but I don't see how it can ever end up in
a state where a node is somehow permanently in unreachable but never recovers. Gossiping happens
at random to those in unreachable, so if we accept that the nodetool ring output proves it
was in unreachable we should be good.

So, again without having a good overall understanding, it seems more likely that the node
is in fact in the unreachable list but that something is causing it never to become marked
as live again (for real, not just in nodetool output).

That in turn I suppose boils down to applyStateLocally(), which is what calls markAlive(),
either not being called or somehow not reaching the point of calling markAlive(). I'm not
up on how the generations work/are intended to work though. What about a disable/enable gossip
round might trigger a different path than regular gossip in applyStateLocally()?


> node stuck in 'Down' in nodetool ring, until disablegossip/enablegossip flapped it back
into submission
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2603
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2603
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Peter Schuller
>            Assignee: Brandon Williams
>
> Cluster with 0.7.4 and 9 machines. I was doing rolling restarts so nodes were expected
to have flappted up/down a bit.
> After cleanup, I noticed that one of the nodes 'nodetool ring' claimed that another node
was Down. I'll call the node that considered the *other* one to be down "UpNode" and the node
that was considered *down* "DownNode".
> DownNode was the next successor on the ring relative to UpNode. Only UpNode thought it
was down; all others members of the clusters agreed it was up. This stayed the case for almost
24 hours.
> In system.log on UpNode, it is clearly visible that DownNode flapped to state UP recently
with no notification of flapping to state DOWN afterwards. Yet 'nodetool ring' reported Down.
> Today, I did disablegossip+wait-for-a-bit+enablegossip on DownNode. This caused 'nodetool
ring' on UpNode to again reflect reality that DownNode is in fact up.
> I do not have a reproducable test case but wanted to file it since I don't remember seeing,
and didn't easily find, a JIRA bug indicating a bug with this effect has recently been fixed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message