hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Tue, 27 Mar 2012 17:54:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239725#comment-13239725

Bikas Saha commented on HADOOP-8220:

bq. The new behavior avoids this, since the error handling patch is in ActiveStandbyElector
itself. This makes it easier to get the right semantics.
Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I
like Hari's proposal of using a return value to inform about fail/success of becoming active.
I am not that familiar with standard practices in Java - are return values preferred or exceptions?

bq. This generates a lot of log spew and has little actual benefit. If instead we retry only
once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode
in the cluster, it will get a chance to try to become active.
I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC
fails to becomes active)?
I do not have any data on the trade off between 1) letting the last active become active again
with log spew 2) letting another standby become active by making the last active sleep. But
for arguments sake I would prefer 1). IMO having continuity in the active node would reduce
the overheads of client/datanode failover etc.

bq. becomeActive() should be protected by a timeout also. If NN is taking far too long to
return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.
Hari, this seems similar to the alternative proposed in HADOOP-8205 about trying to make sure
that the transition to active is short.

> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message