Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40C3818F01 for ; Tue, 19 Apr 2016 00:49:26 +0000 (UTC) Received: (qmail 72473 invoked by uid 500); 19 Apr 2016 00:49:25 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 72395 invoked by uid 500); 19 Apr 2016 00:49:25 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 72377 invoked by uid 99); 19 Apr 2016 00:49:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Apr 2016 00:49:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 753152C1F5A for ; Tue, 19 Apr 2016 00:49:25 +0000 (UTC) Date: Tue, 19 Apr 2016 00:49:25 +0000 (UTC) From: "Haibo Chen (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-6657) job history server can fail on startup when NameNode is in start phase MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246860#comment-15246860 ] Haibo Chen commented on MAPREDUCE-6657: --------------------------------------- Thanks a lot for you comments, [~templedf] I have added a brief javadoc and made the timeout to be 500. Let me know if 500 looks reasonable to you. Also, the test method is now using the existing dfs cluster instead of a new local one. The only method in TestHistoryManager that is using is both dfs clusters is testCreateDirsWithAdditionalFileSystem(), so maybe it makes more sense to move that method out? The behavior of JHS, when name node is in safe mode, is that it throws a YarnRuntimeException with a timeout message. I think the assert message is actually in line with the expected behavior. > job history server can fail on startup when NameNode is in start phase > ---------------------------------------------------------------------- > > Key: MAPREDUCE-6657 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6657 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Reporter: Haibo Chen > Assignee: Haibo Chen > Attachments: mapreduce6657.001.patch, mapreduce6657.002.patch > > > Job history server will try to create a history directory in HDFS on startup. When NameNode is in safe mode, it will keep retrying for a configurable time period. However, it should also keeps retrying if the name node is in start state. Safe mode does not happen until the NN is out of the startup phase. > A RetriableException with the text "NameNode still not started" is thrown when the NN is in its internal service startup phase. We should add the check for this specific exception in isBecauseSafeMode() to account for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)