hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6207) Move application can fail when attempt add event is delayed
Date Mon, 20 Feb 2017 06:34:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874113#comment-15874113

Sunil G commented on YARN-6207:

app null check may not fix the race condition correctly. It can still cause corner case. I
would like to continue discussion on attempt states as well.
Lets take [~bibinchundatt] scenario itself where App Attempt events are in Async Dispatcher
itself (delayed). The same scenario will happen after first attempt failure as well (2nd attempt
is delayed). 

+What will happen in scheduler:+
In any case, we assume that {{SchedulerApplication}} is created inside scheduler (this is
happening because we check app state in CientRMService as ACCEPTED/RUNNING). CS and FS’s
*moveApplication* is invoking {{getApplicationAttempt}} to get app attempt object. {{AbsScheduler#getApplicationAttempt}}
could return null is 2 cases. a) when application itself is not there. b) when curr attempt
is null. As mentioned earlier in first line, app could not be null. Still attempt may be null.
In case of first attempt failure, {{SchedulerApplication.getCurrentAppAttempt}} could return
old object till 2nd attempt is set via {{APP_ATTEMPT_ADDED}}.
Hence the app null check will not help in both scheduler (FS has only a null check for app,
not for attempt). Even attempt null check also won’t help in case of first AM failure as
scheduler does have old stale object in stopped form. 

Also there could be another corner case. Assume move app has called when 1st attempt was failed
and 2nd attempt was in init states. It could potentially push 2 attempts to target queue.
Ideally if we fix in ClientRMServer, we need not have to worry changes across scheduler. If
attempt state is ACCEPTED to RUNNING, we are sure that new attempt is added to scheduler.

> Move application can  fail when attempt add event is delayed
> ------------------------------------------------------------
>                 Key: YARN-6207
>                 URL: https://issues.apache.org/jira/browse/YARN-6207
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-6207.001.patch, YARN-6207.002.patch
> *Steps to reproduce*
> 1.Submit application  and delay attempt add to Scheduler
> (Simulate using debug at EventDispatcher for SchedulerEventDispatcher)
> 2. Call move application to destination queue.
> {noformat}
> Caused by: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.preValidateMoveApplication(CapacityScheduler.java:2086)
> 	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.moveApplicationAcrossQueue(RMAppManager.java:669)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.moveApplicationAcrossQueues(ClientRMService.java:1231)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBServiceImpl.java:388)
> 	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:537)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
> 	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1429)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1339)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
> 	at com.sun.proxy.$Proxy7.moveApplicationAcrossQueues(Unknown Source)
> 	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBClientImpl.java:398)
> 	... 16 more
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message