hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Tue, 27 Mar 2012 21:12:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239957#comment-13239957

Todd Lipcon commented on HADOOP-8220:

bq. Ah. Now I get it. The elector should be robust against client code (ZKFC in this case).
I like Hari's proposal of using a return value to inform about fail/success of becoming active.
I am not that familiar with standard practices in Java - are return values preferred or exceptions?

You got it. Exceptions are generally preferred for cases like this -- since we have to handle
the error condition regardless of whether it's a usual error or whether it was something like
a NPE or other truly exceptional condition. So even with a boolean return type, we'd need
a try/catch clause. Does that make sense? (I also had originally made it return boolean but
then changed it to an exception)

bq. I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC
fails to becomes active)?
Yep. In my test I saw that the standby would retry in a tight loop like that:
# Succeed in getting lock
# Call becomeActive()
# drop ZK session (lock disappears)
# reconnect to ZK
# Goto 1

I simply inserted a sleep between dropping the connection and reconnecting. This gives the
old active a better chance to become active again (or if there is a third node in the future,
it would have a chance to take the lock). In the future we may want to add some random jitter
and exponential backoff, but at this point let's keep it simple.
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message