hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Zhurakousky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1847) YARN application always exits with FAILED state
Date Tue, 18 Mar 2014 14:45:43 GMT

    [ https://issues.apache.org/jira/browse/YARN-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939294#comment-13939294
] 

Oleg Zhurakousky commented on YARN-1847:
----------------------------------------

No, its not expiring. its finishing with CONTAINER_FINISHED event and that is when this transition
to FAILED occurs.
You can easily reproduce it by modifying one of the existing tests:
_TestAMRMClient_ – just change command from "sleep 100' to "ls -a" and you'll see the same
SUCCESS turning into FAILURE ;)

> YARN application always exits with FAILED state
> -----------------------------------------------
>
>                 Key: YARN-1847
>                 URL: https://issues.apache.org/jira/browse/YARN-1847
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Oleg Zhurakousky
>            Priority: Critical
>
> The _RMAppAttemptImpl_ creates an instance of ExpiredTransition which always sets the
_finalAttemptState_ to FAILED.
> {code}
> private static final ExpiredTransition EXPIRED_TRANSITION =
>       new ExpiredTransition();
> . . .
>     public ExpiredTransition() {
>       super(RMAppAttemptState.FAILED);
>     }
> {code}
> So, when my container successfully finishes regardless of the state (e.g., CONTAINER_FINISHED
in my case), the _RMAppAttemptImpl.transition(..)_ does a switch on the _finalAttemptState_
and transitions to FAILED no matter what.
> Here is the related logs for more info:
> {code}
> 21:06:01,615  INFO AsyncDispatcher event handler container.Container:878 - Container
container_1395104684413_0001_01_000001 transitioned from RUNNING to EXITED_WITH_SUCCESS
> 21:06:01,615  INFO AsyncDispatcher event handler launcher.ContainerLaunch:341 - Cleaning
up container container_1395104684413_0001_01_000001
> 21:06:01,644  INFO DeletionService #0 nodemanager.DefaultContainerExecutor:369 - Deleting
absolute path : /Users/oleg/HADOOP_DEV/yarn-tutorial/target/oz.hadoop.StandAloneWithMiniYarnCluster/oz.hadoop.StandAloneWithMiniYarnCluster-localDir-nm-0_0/usercache/oleg/appcache/application_1395104684413_0001/container_1395104684413_0001_01_000001
> 21:06:01,646  INFO AsyncDispatcher event handler nodemanager.NMAuditLogger:89 - USER=oleg
OPERATION=Container Finished - Succeeded	TARGET=ContainerImpl	RESULT=SUCCESS	APPID=application_1395104684413_0001
CONTAINERID=container_1395104684413_0001_01_000001
> 21:06:01,649  INFO AsyncDispatcher event handler container.Container:878 - Container
container_1395104684413_0001_01_000001 transitioned from EXITED_WITH_SUCCESS to DONE
> 21:06:01,649  INFO AsyncDispatcher event handler application.Application:339 - Removing
container_1395104684413_0001_01_000001 from application application_1395104684413_0001
> 21:06:01,649  INFO AsyncDispatcher event handler monitor.ContainersMonitorImpl:159 -
ResourceCalculatorPlugin is unavailable on this system. org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
is disabled.
> 21:06:01,649  INFO AsyncDispatcher event handler containermanager.AuxServices:175 - Got
event CONTAINER_STOP for appId application_1395104684413_0001
> 21:06:02,143  INFO Node Status Updater nodemanager.NodeStatusUpdaterImpl:374 - Removed
completed container container_1395104684413_0001_01_000001
> 21:06:02,146  INFO ResourceManager Event Processor rmcontainer.RMContainerImpl:220 -
container_1395104684413_0001_01_000001 Container Transitioned from ACQUIRED to COMPLETED
> 21:06:02,146  INFO ResourceManager Event Processor fica.FiCaSchedulerApp:91 - Completed
container: container_1395104684413_0001_01_000001 in state: COMPLETED event:FINISHED
> 21:06:02,146  INFO ResourceManager Event Processor resourcemanager.RMAuditLogger:98 -
USER=oleg	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1395104684413_0001
CONTAINERID=container_1395104684413_0001_01_000001
> 21:06:02,146  INFO ResourceManager Event Processor fica.FiCaSchedulerNode:164 - Released
container container_1395104684413_0001_01_000001 of capacity <memory:1024, vCores:1>
on host 192.168.19.1:50787, which currently has 0 containers, <memory:0, vCores:0> used
and <memory:4096, vCores:8> available, release resources=true
> 21:06:02,146  INFO ResourceManager Event Processor fifo.FifoScheduler:790 - Application
appattempt_1395104684413_0001_000001 released container container_1395104684413_0001_01_000001
on node: host: 192.168.19.1:50787 #containers=0 available=4096 used=0 with event: FINISHED
> 21:06:02,146  INFO AsyncDispatcher event handler attempt.RMAppAttemptImpl:960 - Updating
application attempt appattempt_1395104684413_0001_000001 with final state: FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message