hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fang Xie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4892) Job will be hung and can not be finished after resource manager restarting and enabling recovery
Date Wed, 30 Mar 2016 13:59:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217996#comment-15217996
] 

Fang Xie commented on YARN-4892:
--------------------------------

I don't know if you have reproduced this defect. give more  detailed information here.
1. This defect is only reproduced when execute distributedshell job as
hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
org.apache.hadoop.yarn.applications.distributedshell.Client  -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
-shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory
512

I can not found this issue when running MapReduce job.

2. Detailed reproduce steps:
   1>start rm and nm
    2> run a distributedshell job as
hadoop jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
org.apache.hadoop.yarn.applications.distributedshell.Client  -jar /data/hadoop-2.7.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
-shell_command sleep -shell_args 2 -num_containers 10 -container_memory 512 -master_memory
512
    3>When job progress reach around 60%, kill RM process.
    4> start RM again.
    5> Job can be finished successfully from cli , but when go to Yarn GUI (http://rmaddress:8088),
there is a job still  in running status.
     6> No matter how many times when you restart RM, this job still in running status
from GUI

3. This issue can not be found in hadoop 2.6.1 or lower version.
4. Attach my investigation result here:
    Before hadoop 2.7.x, when RM killed, Application Mater process will be stoped, but in
hadoop 2.7.x, when RM killed, Application Master still alive. in org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.java

 protected boolean finish() {
    // wait for completion.
	    while (!done
        && (numCompletedContainers.get() != numTotalContainers)) {
      try {
    	         Thread.sleep(200);
      } catch (InterruptedException ex) {}
    }

    if(timelineClient != null) {
      publishApplicationAttemptEvent(timelineClient, appAttemptID.toString(),
          DSEvent.DS_APP_ATTEMPT_END, domainId, appSubmitterUgi);
    }

The job can not be finished due to always in loop   
 while (!done
        && (numCompletedContainers.get() != numTotalContainers)) {
      try {
    	         Thread.sleep(200);
      } catch (InterruptedException ex) {}

The value of numCompletedContainers.get() is larger the real number of tasks ( numTotalContainers
).
So Application Master can not send job finish message to RM.



      




> Job will be hung and can not be finished after resource manager restarting and enabling
recovery
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4892
>                 URL: https://issues.apache.org/jira/browse/YARN-4892
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Fang Xie
>            Priority: Critical
>
> Enable resourcemanager recovery, set properties as below:
> <property>
>     <description>Enable RM to recover state after starting. If true, then
>     yarn.resourcemanager.store.class must be specified. </description>
>    <name>yarn.resourcemanager.recovery.enabled</name>
>    <value>true</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.store.class</name>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.fs.state-store.uri</name>
>     <value>hdfs://apple02:9000/rmstore</value>
> </property>
> run a distributedshell job, when job running, kill resourcemanager, and then restart
resourcemanager, this job can not be finished and will be hung.
> Both fair-share and capacity scheduler have such issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message