hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
Date Sat, 18 Aug 2012 00:04:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437160#comment-13437160
] 

Aaron T. Myers commented on HDFS-3561:
--------------------------------------

That's a good point, Vinay, that the method will only ever be called once, but I still think
that creating a copy of the conf object in the FailoverController constructor makes the code
a little clearer. I don't think the increased memory usage from having one extra copy of the
conf object will be an issue at all. It will also be good from a future-proofing perspective
to make sure that any mutations to the passed-in Configuration object don't affect the behavior
of a long-lived FailoverController object. Does that make sense? Note that I don't feel super
strongly about this; it's just my preference. If you disagree, we can go with what you have
here.

Two little nits I noticed while taking another look at this patch:

# There's no need for the new getGracefulFenceConnectRetries function, since it's only ever
called from the constructor of this class. The other two similar methods are necessary because
they're called from the ZKFailoverController class. At the very least, the function should
be made private.
# There's no need for these lines two be on separate lines:
{code}
+    newConf
+        .setInt(
{code}
                
> ZKFC retries for 45 times to connect to other NN during fencing when network between
NNs broken and standby Nn will not take over as active 
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3561
>                 URL: https://issues.apache.org/jira/browse/HDFS-3561
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 2.1.0-alpha, 3.0.0
>            Reporter: suja s
>            Assignee: Vinay
>            Priority: Critical
>         Attachments: HDFS-3561-2.patch, HDFS-3561.patch
>
>
> Scenario:
> Active NN on machine1
> Standby NN on machine2
> Machine1 is isolated from the network (machine1 network cable unplugged)
> After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there.
> ZKFC tries to failover NN2 as active.
> As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence
technique configured)
> This connection retry happens for 45 times( as it takes  ipc.client.connect.max.socket.retries)
> Also after that standby NN is not able to take over as active (because of fencing failure).
> Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it
can consider that NN as dead and instruct the other NN to take over as active as there is
no chance of the other NN (NN1) retaining its state as active after zk session timeout when
its isolated from network
> From ZKFC log:
> {noformat}
> 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
> 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
> 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
> 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
> 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
> 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
> 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
> 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
> 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
> 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
> {noformat}
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message