Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F4409975 for ; Mon, 25 Jun 2012 14:00:44 +0000 (UTC) Received: (qmail 43785 invoked by uid 500); 25 Jun 2012 14:00:44 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 43749 invoked by uid 500); 25 Jun 2012 14:00:44 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 43702 invoked by uid 99); 25 Jun 2012 14:00:44 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2012 14:00:44 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id EE89814002D for ; Mon, 25 Jun 2012 14:00:43 +0000 (UTC) Date: Mon, 25 Jun 2012 14:00:43 +0000 (UTC) From: "Vinay (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1594600359.50986.1340632843981.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1530947437.50343.1340613223073.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400487#comment-13400487 ] Vinay commented on HDFS-3561: ----------------------------- During transition, fencing of old active will be done. Here before actually using the fencing method configured, gracefull fencing will be tried. Now zkfc will try to get the proxy of other machine Namenode. since the n/w is down, it is not able to get the connection and it is retrying for 45 times configured using *ipc.client.connect.max.retries.on.timeouts* {code}LOG.info("Should fence: " + target); boolean gracefulWorked = new FailoverController(conf, RequestSource.REQUEST_BY_ZKFC).tryGracefulFence(target); if (gracefulWorked) { // It's possible that it's in standby but just about to go into active, // no? Is there some race here? LOG.info("Successfully transitioned " + target + " to standby " + "state without fencing"); return; }{code} I think in ZKFC case we can reduce the number of retries. Any thoughts? > ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-3561 > URL: https://issues.apache.org/jira/browse/HDFS-3561 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Reporter: suja s > Assignee: Vinay > > Scenario: > Active NN on machine1 > Standby NN on machine2 > Machine1 is isolated from the network (machine1 network cable unplugged) > After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. > ZKFC tries to failover NN2 as active. > As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) > This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) > Also after that standby NN is not able to take over as active (because of fencing failure). > Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network > From ZKFC log: > {noformat} > 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). > 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). > 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). > 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). > 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). > 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). > 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). > 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). > 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). > 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). > {noformat} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira