hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Mankude (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Tue, 27 Mar 2012 17:16:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239653#comment-13239653

Hari Mankude commented on HADOOP-8220:


Couple of comments on the design rather than the code changes

1. becomeActive() could have two return values at the appClient (true or false). For example,
NN might decide that it cannot become active since it does not have access to the resources
to become active. We would need to handle these two return values. If the return value is
false, FC will give up the ephemeral znode for one "timeout" iteration to allow other NN's
FC to take over the znode.

2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return,
FC should declare failure and give up the lock. Otherwise, it is a deadlock.

As you commented on the other jira, it might be useful to create a seperate branch for automatic
failover. There will be lots of corner cases to deal with. 

> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message