hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1847) YARN application always exits with FAILED state
Date Tue, 18 Mar 2014 20:56:43 GMT

    [ https://issues.apache.org/jira/browse/YARN-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939772#comment-13939772
] 

Jason Lowe commented on YARN-1847:
----------------------------------

bq. Again, the real issue is the API which does not help at all. If I could submit any command
then I should expect the lifecycle to be handled by the framework.

While the ApplicationSubmissionContext allows an free-form command to be listed to run as
the AM, that doesn't mean that just any command listed there will work.  YARN is not a distributed
shell capable of running arbitrary shell commands as applications.  (The DistributedShell
example application in YARN is closer to this.)   Each application must provide an ApplicationMaster
that properly communicates with the RM via the ApplicationMasterProtocol in order to participate
within a YARN cluster.

I'm sorry that you expected arbitrary shell commands to work in an ApplicationSubmissionContext,
and the TestAMRMClient code is a terrible example of a proper YARN application.  Indeed, it
doesn't even run properly without "help" from the test methods to emulate what a real AM must
do.

Resolving this as Invalid since YARN applications do not always exit as FAILED and applications
are not normally going through ExpiredTransition within the RM.  If you have followup concrete
suggestions for how to make the AM-RM API better beyond what the AMRMClient abstraction already
does that would be great.  Feel free to discuss them on the [yarn-dev@|http://hadoop.apache.org/mailing_lists.html#Developers-N10174]
list or under a separate JIRA.

> YARN application always exits with FAILED state
> -----------------------------------------------
>
>                 Key: YARN-1847
>                 URL: https://issues.apache.org/jira/browse/YARN-1847
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Oleg Zhurakousky
>            Priority: Critical
>
> The _RMAppAttemptImpl_ creates an instance of ExpiredTransition which always sets the
_finalAttemptState_ to FAILED.
> {code}
> private static final ExpiredTransition EXPIRED_TRANSITION =
>       new ExpiredTransition();
> . . .
>     public ExpiredTransition() {
>       super(RMAppAttemptState.FAILED);
>     }
> {code}
> So, when my container successfully finishes regardless of the state (e.g., CONTAINER_FINISHED
in my case), the _RMAppAttemptImpl.transition(..)_ does a switch on the _finalAttemptState_
and transitions to FAILED no matter what.
> Here is the related logs for more info:
> {code}
> 21:06:01,615  INFO AsyncDispatcher event handler container.Container:878 - Container
container_1395104684413_0001_01_000001 transitioned from RUNNING to EXITED_WITH_SUCCESS
> 21:06:01,615  INFO AsyncDispatcher event handler launcher.ContainerLaunch:341 - Cleaning
up container container_1395104684413_0001_01_000001
> 21:06:01,644  INFO DeletionService #0 nodemanager.DefaultContainerExecutor:369 - Deleting
absolute path : /Users/oleg/HADOOP_DEV/yarn-tutorial/target/oz.hadoop.StandAloneWithMiniYarnCluster/oz.hadoop.StandAloneWithMiniYarnCluster-localDir-nm-0_0/usercache/oleg/appcache/application_1395104684413_0001/container_1395104684413_0001_01_000001
> 21:06:01,646  INFO AsyncDispatcher event handler nodemanager.NMAuditLogger:89 - USER=oleg
OPERATION=Container Finished - Succeeded	TARGET=ContainerImpl	RESULT=SUCCESS	APPID=application_1395104684413_0001
CONTAINERID=container_1395104684413_0001_01_000001
> 21:06:01,649  INFO AsyncDispatcher event handler container.Container:878 - Container
container_1395104684413_0001_01_000001 transitioned from EXITED_WITH_SUCCESS to DONE
> 21:06:01,649  INFO AsyncDispatcher event handler application.Application:339 - Removing
container_1395104684413_0001_01_000001 from application application_1395104684413_0001
> 21:06:01,649  INFO AsyncDispatcher event handler monitor.ContainersMonitorImpl:159 -
ResourceCalculatorPlugin is unavailable on this system. org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
is disabled.
> 21:06:01,649  INFO AsyncDispatcher event handler containermanager.AuxServices:175 - Got
event CONTAINER_STOP for appId application_1395104684413_0001
> 21:06:02,143  INFO Node Status Updater nodemanager.NodeStatusUpdaterImpl:374 - Removed
completed container container_1395104684413_0001_01_000001
> 21:06:02,146  INFO ResourceManager Event Processor rmcontainer.RMContainerImpl:220 -
container_1395104684413_0001_01_000001 Container Transitioned from ACQUIRED to COMPLETED
> 21:06:02,146  INFO ResourceManager Event Processor fica.FiCaSchedulerApp:91 - Completed
container: container_1395104684413_0001_01_000001 in state: COMPLETED event:FINISHED
> 21:06:02,146  INFO ResourceManager Event Processor resourcemanager.RMAuditLogger:98 -
USER=oleg	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1395104684413_0001
CONTAINERID=container_1395104684413_0001_01_000001
> 21:06:02,146  INFO ResourceManager Event Processor fica.FiCaSchedulerNode:164 - Released
container container_1395104684413_0001_01_000001 of capacity <memory:1024, vCores:1>
on host 192.168.19.1:50787, which currently has 0 containers, <memory:0, vCores:0> used
and <memory:4096, vCores:8> available, release resources=true
> 21:06:02,146  INFO ResourceManager Event Processor fifo.FifoScheduler:790 - Application
appattempt_1395104684413_0001_000001 released container container_1395104684413_0001_01_000001
on node: host: 192.168.19.1:50787 #containers=0 available=4096 used=0 with event: FINISHED
> 21:06:02,146  INFO AsyncDispatcher event handler attempt.RMAppAttemptImpl:960 - Updating
application attempt appattempt_1395104684413_0001_000001 with final state: FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message