hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Mankude (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs
Date Wed, 04 Apr 2012 23:44:22 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246852#comment-13246852

Hari Mankude commented on HDFS-3192:

bq.Can you explain why it has to restart, instead of just transitioning to standby? What do
you mean by "in limbo" here?

"in limbo" implies that NN1 thinks that it is active even though NN2 has taken over since
it has not tried to access editlogs. So, it is not behaving as standby and keeping up with
active. Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode? On an
active NN, there is a high probability that it might abort. Also, does transitionToStandby()
guarantee that all the active-state threads have quisced? 

bq.Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence"
– ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController

I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is
opening up one more channel of communication between NN1 and NN2 and this is subject to various
races sequences, split-brain etc. I think self-resign is much safer than trygracefulfence().
So far, I dont see a lack of correctness argument in our discussion. Is my description correct

> Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for
timeout secs
> --------------------------------------------------------------------------------------------------
>                 Key: HDFS-3192
>                 URL: https://issues.apache.org/jira/browse/HDFS-3192
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message