Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E611198EB for ; Mon, 18 Apr 2016 18:37:28 +0000 (UTC) Received: (qmail 46222 invoked by uid 500); 18 Apr 2016 18:37:26 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 45996 invoked by uid 500); 18 Apr 2016 18:37:26 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 45780 invoked by uid 99); 18 Apr 2016 18:37:26 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2016 18:37:26 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 560762C1F74 for ; Mon, 18 Apr 2016 18:37:26 +0000 (UTC) Date: Mon, 18 Apr 2016 18:37:26 +0000 (UTC) From: "Daniel Templeton (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-6657) job history server can fail on startup when NameNode is in start phase MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246282#comment-15246282 ] Daniel Templeton commented on MAPREDUCE-6657: --------------------------------------------- Thanks for the patch, [~haibochen]. I hate that HDFS expects you to parse the text of their exceptions to figure out what's going on. Wanna look into whether the API would allow you to throw a properly typed exception? Maybe just file a followup JIRA? In your test code, it would be nice to add a javadoc header that explains what you're testing. I don't love that you're running two mini-clusters and ignoring one of them. Is there any way to do the test with the existing mini-cluster without disrupting the other tests? If not, I'd consider creating a new test class so that you don't have two mini-clusters running. Is 2000ms the shortest reasonable duration for the timeout? Seems long to me... {code} Assert.assertEquals("Job History Server is expected to time out.", {code} Your assert message is misleading. It should instead say that it didn't get the expected error message. > job history server can fail on startup when NameNode is in start phase > ---------------------------------------------------------------------- > > Key: MAPREDUCE-6657 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6657 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Reporter: Haibo Chen > Assignee: Haibo Chen > Attachments: mapreduce6657.001.patch, mapreduce6657.002.patch > > > Job history server will try to create a history directory in HDFS on startup. When NameNode is in safe mode, it will keep retrying for a configurable time period. However, it should also keeps retrying if the name node is in start state. Safe mode does not happen until the NN is out of the startup phase. > A RetriableException with the text "NameNode still not started" is thrown when the NN is in its internal service startup phase. We should add the check for this specific exception in isBecauseSafeMode() to account for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)