hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5476) Not existed application reported as ACCEPTED state by YarnClientImpl
Date Tue, 09 Aug 2016 18:55:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414043#comment-15414043
] 

Junping Du commented on YARN-5476:
----------------------------------

After discussed with Yesha, we found the root cause here is because:
1. yarn client looping in submit application until it get ACCEPTED status from getApplicationReport().
If getApplicationReport() return ApplicationNoFound exception, it will go ahead to resubmit
the application.
2. The call to getApplicationReport() will first go to check RM, if RM return ApplicationNoFound,
it means RM doesn't have any info about this application. Basically, two possibility here:
a. app is finished and RM remove track for this; b. app info haven't get persistent to RMStateStore
before RM fail over/restart. Here the case belongs to case b.
3. Although app info haven't get persistent into RMStateStore yet, the app event already sent
to ATS for handling so ATS will record this app and its initiated state - ACCEPTED. so getApplicationReport()
will return ACCEPTED, and yarn client quit the loop in submit application but actually the
app is already forgotten by RM.

As a quick solution, we should move RM notify ATS later to wait at least NEW_SAVING states
so RM state store get persistent on this application already.

> Not existed application reported as ACCEPTED state by YarnClientImpl
> --------------------------------------------------------------------
>
>                 Key: YARN-5476
>                 URL: https://issues.apache.org/jira/browse/YARN-5476
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Yesha Vora
>            Assignee: Junping Du
>            Priority: Critical
>
> Steps To reproduce: 
> * Create a cluster with RM HA enabled
> * Start a yarn application
> * When yarn application is in NEW state, do RM failover. 
> In this case, the application gets "ApplicationNotFound" exception from YARN. 
> and it goes to accepted state and gets stuck. 
> At this point, if yarn application -status <appId> is run, it says that application
is in ACCEPTED state. 
> This state is misleading. 
> {code}
> hrt_qa@xxx:/root> yarn application -status application_1470379565464_0001
> 16/08/05 17:24:29 INFO impl.TimelineClientImpl: Timeline service address: https://xxx:8190/ws/v1/timeline/
> 16/08/05 17:24:30 INFO client.AHSProxy: Connecting to Application History server at xxx/xxx:10200
> 16/08/05 17:24:31 WARN retry.RetryInvocationHandler: Exception while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport
over rm1. Not retrying because try once and fail.
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1470379565464_0001'
doesn't exist in RM.
> 	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:331)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> 	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
> 	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:194)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
> 	at com.sun.proxy.$Proxy18.getApplicationReport(Unknown Source)
> 	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:436)
> 	at org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:481)
> 	at org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:160)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> 	at org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:83)
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException):
Application with id 'application_1470379565464_0001' doesn't exist in RM.
> 	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:331)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> 	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
> 	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1496)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1396)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> 	at com.sun.proxy.$Proxy17.getApplicationReport(Unknown Source)
> 	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)
> 	... 14 more
> Application Report : 
> 	Application-Id : application_1470379565464_0001
> 	Application-Name : null
> 	Application-Type : null
> 	User : null
> 	Queue : null
> 	Application Priority : null
> 	Start-Time : 0
> 	Finish-Time : 0
> 	Progress : 0%
> 	State : ACCEPTED
> 	Final-State : UNDEFINED
> 	Tracking-URL : N/A
> 	RPC Port : -1
> 	AM Host : N/A
> 	Aggregate Resource Allocation : N/A
> 	Log Aggregation Status : N/A
> 	Diagnostics : 
> 	Unmanaged Application : false
> 	Application Node Label Expression : null{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message