cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-3273) FailureDetector can take a very long time to mark a host down
Date Thu, 29 Sep 2011 21:05:46 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brandon Williams updated CASSANDRA-3273:
----------------------------------------

    Attachment: 3273.txt

bq. What if we reset the intervals when we get a node back-from-the-dead?

That makes sense if we're observing a generation change, the node either rebooted or was taken
over by a new machine, so relearning the network characteristics is a good idea.

In the case that there was only a heartbeat change, that indicates there was something bad
(most likely in the network) and we should remember that for next time to avoid flapping.
 However, in the case of a long partition where the generation won't change, we don't want
to record the partition time as an interval since if the partition reoccurs soon, it will
take us a very long time to mark the host down again.

This patch clears the intervals on a generation change, and handles the long partition case
by defining a reasonable maximum to record, in this case the rpc timeout, since adapting beyond
this rather than failing quickly doesn't make much sense that I can think of, but I'll entertain
a higher hard set default if anyone disagrees.
                
> FailureDetector can take a very long time to mark a host down
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-3273
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3273
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>         Attachments: 3273.txt
>
>
> There are two ways to trigger this:
> * Bring a node up very briefly in a mixed-version cluster and then terminate it
> * Bring a node up, terminate it for a very long time, then bring it back up and take
it down again
> In the first case, what can happen is a very short interval arrival time is recorded
by the versioning logic which requires reconnecting and can happen very quickly. This can
easily be solved by rejecting any intervals within a reasonable bound, for instance the gossiper
interval.
> The second instance is harder to solve, because what is happening is that an extremely
large interval is recorded, which is the time the node was left dead the first time.  This
throws off the mean of the intervals and causes it to take a much longer time than it should
to mark it down the second time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message