hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1842) InvalidApplicationMasterRequestException raised during AM-requested shutdown
Date Wed, 23 Apr 2014 06:18:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977886#comment-13977886
] 

Wangda Tan commented on YARN-1842:
----------------------------------

Took a look at this, I'm wondering if it's caused by this case
1) Client asked kill application, 
2) After RM transferred application's state to killed, and before AM container actually killed
by NM, the AM asked to finish application
Since the RMAppAttempt already called AMS.unregisterAttempt, the attempt will be cleaned from
cache, thus the InvalidApplicationMasterRequestException will be raised.

I guess this after reading log uploaded by [~keyki], 
Still pretty good in following log,
{code}
2014-03-18 19:36:50,802 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1395167286771_0002 State change from ACCEPTED to RUNNING
2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from NEW to ALLOCATED
2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=keyki	OPERATION=AM Allocated Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1395167286771_0002
CONTAINERID=container_1395167286771_0002_01_000002
2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
Assigned container container_1395167286771_0002_01_000002 of capacity <memory:1024, vCores:1>
on host localhost:56214, which currently has 2 containers, <memory:2048, vCores:2> used
and <memory:6144, vCores:6> available
2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
assignedContainer application=application_1395167286771_0002 container=Container: [ContainerId:
container_1395167286771_0002_01_000002, NodeId: localhost:56214, NodeHttpAddress: localhost:8042,
Resource: <memory:1024, vCores:1>, Priority: 1, Token: Token { kind: ContainerToken,
service: 127.0.0.1:56214 }, ] containerId=container_1395167286771_0002_01_000002 queue=default:
capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>usedCapacity=0.125,
absoluteUsedCapacity=0.125, numApps=1, numContainers=1 usedCapacity=0.125 absoluteUsedCapacity=0.125
used=<memory:1024, vCores:1> cluster=<memory:8192, vCores:8>
2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0,
usedResources=<memory:2048, vCores:2>usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1,
numContainers=2
2014-03-18 19:36:52,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used=<memory:2048,
vCores:2> cluster=<memory:8192, vCores:8>
2014-03-18 19:36:52,961 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2014-03-18 19:36:53,536 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from ACQUIRED to RUNNING
{code}

Client asked kill application, and AMS.unregisterAttempt called, attempt will be removed from
AMS cache
{code}
2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=keyki	IP=37.139.29.192	OPERATION=Kill Application Request	TARGET=ClientRMService	RESULT=SUCCESS
APPID=application_1395167286771_0002
2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Removing info for app: application_1395167286771_0002
2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1395167286771_0002 State change from RUNNING to KILLED
2014-03-18 19:38:50,428 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1395167286771_0002_000001
{code}

After that, AM asked finishApplication, but unfortunately, attempt is already removed from
cache
{code}
2014-03-18 19:38:51,397 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1395167286771_0002_000001
2014-03-18 19:38:52,415 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Application doesn't exist in cache appattempt_1395167286771_0002_000001
{code}

I'm not sure if it's possible in current Hoya design, please correct me if I was wrong. 

> InvalidApplicationMasterRequestException raised during AM-requested shutdown
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1842
>                 URL: https://issues.apache.org/jira/browse/YARN-1842
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: hoyalogs.tar.gz
>
>
> Report of the RM raising a stack trace [https://gist.github.com/matyix/9596735] during
AM-initiated shutdown. The AM could just swallow this and exit, but it could be a sign of
a race condition YARN-side, or maybe just in the RM client code/AM dual signalling the shutdown.

> I haven't replicated this myself; maybe the stack will help track down the problem. Otherwise:
what is the policy YARN apps should adopt for AM's handling errors on shutdown? go straight
to an exit(-1)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message