hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Wed, 28 Mar 2012 22:07:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240776#comment-13240776

Todd Lipcon commented on HADOOP-8220:

Yep, your updated description of the "tight loop" is exactly right. Sorry, I didn't note the
fact that becomeActive() throws an exception in this scenario.

New draft of the patch attached.

- Added a true unit test for the new changes, in addition to the functional test from the
prior revision (TestActiveStandbyElector#testFailToBecomeActive)
- Change the control flow so that the success and error cases are kept near each other (suggested
by Bikas above)
- Changed the sleep calls to be wrapped in a {{sleepFor(ms)}} function, so it's easy to disable
the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason.

In response to a couple comments above that got lost in the discussion:
2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return,
FC should declare failure and give up the lock. Otherwise, it is a deadlock.
This is really difficult to do reliably, since there's no good way to 'cancel' the callback.
The {{transitionToActive}} RPC itself should have a timeout attached -- it's much more straightforward
to do that than to try to make ActiveStandbyElector guard against arbitrary code running too
long in the callback. I added a note to the javadoc indicating this.

Do you really want to commit the logs added to ActiveStandbyTestUtil?
Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since
I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited
the logging to only two per second, so it shouldn't make the logs too noisy, while retaining
the advantage of seeing what's going on better if there is a timeout.

> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt, hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message