Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF94119193 for ; Wed, 30 Mar 2016 13:59:28 +0000 (UTC) Received: (qmail 1176 invoked by uid 500); 30 Mar 2016 13:59:28 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 1075 invoked by uid 500); 30 Mar 2016 13:59:28 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 1036 invoked by uid 99); 30 Mar 2016 13:59:26 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Mar 2016 13:59:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 75B862C1F5D for ; Wed, 30 Mar 2016 13:59:25 +0000 (UTC) Date: Wed, 30 Mar 2016 13:59:25 +0000 (UTC) From: "Fang Xie (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4892) Job will be hung and can not be finished after resource manager restarting and enabling recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217996#comment-15217996 ] Fang Xie commented on YARN-4892: -------------------------------- I don't know if you have reproduced this defect. give more detailed information here. 1. This defect is only reproduced when execute distributedshell job as hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar org.apache.hadoop.yarn.applications.distributedshell.Client -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar -shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory 512 I can not found this issue when running MapReduce job. 2. Detailed reproduce steps: 1>start rm and nm 2> run a distributedshell job as hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar org.apache.hadoop.yarn.applications.distributedshell.Client -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar -shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory 512 3>When job progress reach around 60%, kill RM process. 4> start RM again. 5> Job can be finished successfully from cli , but when go to Yarn GUI (http://rmaddress:8088), there is a job still in running status. 6> No matter how many times when you restart RM, this job still in running status from GUI 3. This issue can not be found in hadoop 2.6.1 or lower version. 4. Attach my investigation result here: Before hadoop 2.7.x, when RM killed, Application Mater process will be stoped, but in hadoop 2.7.x, when RM killed, Application Master still alive. in org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.java protected boolean finish() { // wait for completion. while (!done && (numCompletedContainers.get() != numTotalContainers)) { try { Thread.sleep(200); } catch (InterruptedException ex) {} } if(timelineClient != null) { publishApplicationAttemptEvent(timelineClient, appAttemptID.toString(), DSEvent.DS_APP_ATTEMPT_END, domainId, appSubmitterUgi); } The job can not be finished due to always in loop while (!done && (numCompletedContainers.get() != numTotalContainers)) { try { Thread.sleep(200); } catch (InterruptedException ex) {} The value of numCompletedContainers.get() is larger the real number of tasks ( numTotalContainers ). So Application Master can not send job finish message to RM. > Job will be hung and can not be finished after resource manager restarting and enabling recovery > ------------------------------------------------------------------------------------------------ > > Key: YARN-4892 > URL: https://issues.apache.org/jira/browse/YARN-4892 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.1 > Reporter: Fang Xie > Priority: Critical > > Enable resourcemanager recovery, set properties as below: > > Enable RM to recover state after starting. If true, then > yarn.resourcemanager.store.class must be specified. > yarn.resourcemanager.recovery.enabled > true > > > > yarn.resourcemanager.store.class > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore > > > > yarn.resourcemanager.fs.state-store.uri > hdfs://apple02:9000/rmstore > > run a distributedshell job, when job running, kill resourcemanager, and then restart resourcemanager, this job can not be finished and will be hung. > Both fair-share and capacity scheduler have such issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)