Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 30 Mar 2016 13:59:25 +0000 (UTC)
From: "Fang Xie (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12954290.1459258808000.88198.1459346365479@Atlassian.JIRA>
In-Reply-To: <JIRA.12954290.1459258808000@Atlassian.JIRA>
References: <JIRA.12954290.1459258808000@Atlassian.JIRA>
 <JIRA.12954290.1459258808776@arcas>
Subject: [jira] [Commented] (YARN-4892) Job will be hung and can not be
 finished after resource manager restarting and enabling recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217996#comment-15217996 ] 

Fang Xie commented on YARN-4892:
--------------------------------

I don't know if you have reproduced this defect. give more  detailed information here.
1. This defect is only reproduced when execute distributedshell job as
hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar org.apache.hadoop.yarn.applications.distributedshell.Client  -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar -shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory 512

I can not found this issue when running MapReduce job.

2. Detailed reproduce steps:
   1>start rm and nm
    2> run a distributedshell job as
hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar org.apache.hadoop.yarn.applications.distributedshell.Client  -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar -shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory 512
    3>When job progress reach around 60%, kill RM process.
    4> start RM again.
    5> Job can be finished successfully from cli , but when go to Yarn GUI (http://rmaddress:8088), there is a job still  in running status.
     6> No matter how many times when you restart RM, this job still in running status from GUI

3. This issue can not be found in hadoop 2.6.1 or lower version.
4. Attach my investigation result here:
    Before hadoop 2.7.x, when RM killed, Application Mater process will be stoped, but in hadoop 2.7.x, when RM killed, Application Master still alive. in org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.java 
 protected boolean finish() {
    // wait for completion.
	    while (!done
        && (numCompletedContainers.get() != numTotalContainers)) {
      try {
    	         Thread.sleep(200);
      } catch (InterruptedException ex) {}
    }

    if(timelineClient != null) {
      publishApplicationAttemptEvent(timelineClient, appAttemptID.toString(),
          DSEvent.DS_APP_ATTEMPT_END, domainId, appSubmitterUgi);
    }

The job can not be finished due to always in loop   
 while (!done
        && (numCompletedContainers.get() != numTotalContainers)) {
      try {
    	         Thread.sleep(200);
      } catch (InterruptedException ex) {}

The value of numCompletedContainers.get() is larger the real number of tasks ( numTotalContainers ).
So Application Master can not send job finish message to RM.


> Job will be hung and can not be finished after resource manager restarting and enabling recovery
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4892
>                 URL: https://issues.apache.org/jira/browse/YARN-4892
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Fang Xie
>            Priority: Critical
>
> Enable resourcemanager recovery, set properties as below:
> <property>
>     <description>Enable RM to recover state after starting. If true, then
>     yarn.resourcemanager.store.class must be specified. </description>
>    <name>yarn.resourcemanager.recovery.enabled</name>
>    <value>true</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.store.class</name>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.fs.state-store.uri</name>
>     <value>hdfs://apple02:9000/rmstore</value>
> </property>
> run a distributedshell job, when job running, kill resourcemanager, and then restart resourcemanager, this job can not be finished and will be hung.
> Both fair-share and capacity scheduler have such issue.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)