hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs
Date Wed, 04 Apr 2012 22:56:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246818#comment-13246818

Todd Lipcon commented on HDFS-3192:

bq. So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has to restart
NN1. Otherwise, we don't know as to how long NN1 will stay in limbo.

Can you explain why it has to restart, instead of just transitioning to standby? What do you
mean by "in limbo" here?

bq. Also, NN1 could resign much earlier without having go through uncontrolled abort via fencing
Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence"
-- ie ask it to self-resign via an RPC. See the {{tryGracefulFence}} function in the {{FailoverController}}

Having the other node asking it to resign is better than having it ask itself to resign --
the reason being that this is the only way the other node can be sure that it's "in the clear"
to start writing to the logs. (a "self-resignation" might come too late). Since the other
node always has to verify the resignation before it starts to write, there's nothing extra
gained by having it resign itself first. It's just a redundancy.

> Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for
timeout secs
> --------------------------------------------------------------------------------------------------
>                 Key: HDFS-3192
>                 URL: https://issues.apache.org/jira/browse/HDFS-3192
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message