hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Tue, 27 Mar 2012 21:42:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239976#comment-13239976

Bikas Saha commented on HADOOP-8220:

Do you mean?
1. Succeed in getting lock
2. Call becomeActive()
3. ZKFC fails to become active. Call quitElection.
4. drop ZK session (lock disappears)
5. reconnect to ZK
6. Goto 1

Otherwise I dont see why lock disappears.
If yes, then this might be ok, since by design we are deciding to sleep and let someone else
take a shot at becoming active because we are having trouble doing so. Could you please add
this in comments so that the sleeping is explained.

Now that I am looking at the patch in a less sleep state I think the following might be a
better flow because the actions on failure and success are in one place

if (becomeActive()) {
else {
  LOG.warn("Failed to become active. Rejoin election to try again but sleep before that to
let someone else try.");
This puts the fail/success handling of becomeActive() in 1 place and avoids the calling of
becomeActive() have a side-effect of also calling rejoinElection() on failure.
What do you think?
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message