Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA86D104B3 for ; Wed, 19 Mar 2014 06:23:47 +0000 (UTC) Received: (qmail 4193 invoked by uid 500); 19 Mar 2014 06:23:46 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 4158 invoked by uid 500); 19 Mar 2014 06:23:44 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 4129 invoked by uid 99); 19 Mar 2014 06:23:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Mar 2014 06:23:42 +0000 Date: Wed, 19 Mar 2014 06:23:42 +0000 (UTC) From: "Jing Zhao (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940229#comment-13940229 ] Jing Zhao commented on HDFS-6089: --------------------------------- Hi Andrew, thanks for the explanation. I guess I understand your concern now: only rolling on ANN based on edits # may cause issue in some scenario. This is because if we don't have further operations it is possible that SBN will wait a long time to tail that part of edits which is in an in-progress segment. bq. Checkpointing combines the edit log with the fsimage, and we purge unnecessary log segments afterwards. But I'm still a little confused about this part. I fail to see the difference of the based-on-time rolling from SBN and ANN. In the current code, SBN triggers rolling still through RPC to ANN. Also this does not affect checkpointing and purging: when SBN does a checkpoint, both SBN and ANN will purge old edits in their own storage (SBN does the purging before uploading the checkpoint, and ANN does it after getting the new fsimage). So I guess a possible solution may be: just letting ANN does rolling every 2min. I think this can achieve almost the same effect as the current mechanism, without delaying the failover. Or you see some counter examples with this change? Back to the changing the rpc timeout solution. Looks like we have not set timeout for this NN-->NN rpc right now (correct me if I'm wrong). Setting a timeout (e.g., 20s just like the default timeout from client to NN) of course can improve the failover time in our test case, but I still prefer the above solution because it makes the rolling behavior simpler and more predictable (especially it removes the rpc call from SBN to ANN). > Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended > ------------------------------------------------------------------------------------------------------------------------ > > Key: HDFS-6089 > URL: https://issues.apache.org/jira/browse/HDFS-6089 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.4.0 > Reporter: Arpit Gupta > Assignee: Jing Zhao > Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch > > > The following scenario was tested: > * Determine Active NN and suspend the process (kill -19) > * Wait about 60s to let the standby transition to active > * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active. > What was noticed that some times the call to get the service state of nn2 got a socket time out exception. -- This message was sent by Atlassian JIRA (v6.2#6252)