hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prabhu Joseph (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM
Date Sat, 26 Nov 2016 15:55:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15698124#comment-15698124
] 

Prabhu Joseph commented on YARN-5933:
-------------------------------------

Hi [~sunilg] [~gtCarrera9], Below are some of the ways to fix this issue assuming an application
which is not found in RM at first getApplicationReport call will never be one of APP_FINAL_STATES
at subsequent getApplicationReport call.

1. Once the AppState is Unknown, the appDir can be removed from ActivePath immediately. Not
sure why there is a wait of unknownActiveMillis and then app marked as completed. If we choose
removal of appDir immediately, then there won't be any need for unknownActiveMillis handling
code.
2. If there is a need to move unknown state app also to done directory, then the appDir can
be moved immediately before waiting for unknownActiveMillis 

Please share your comments.

> ATS stale entries in active directory causes ApplicationNotFoundException in RM
> -------------------------------------------------------------------------------
>
>                 Key: YARN-5933
>                 URL: https://issues.apache.org/jira/browse/YARN-5933
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.3
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while getting TIMELINE_DELEGATION_TOKEN
with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from alltypesorc group
by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection retries
limit exceeded. The posted timeline event may be missing
> 	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
> 	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
> 	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
> 	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
> 	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
> 	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
> 	at org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
> 	at org.apache.tez.client.TezClient.start(TezClient.java:409)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
> 	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> 	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
> 	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
> 	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
> 	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
> 	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
> 	at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
> 	at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
> 	at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, ATS tries
to get the application report from RM and so RM will throw ApplicationNotFoundException. ATS
will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService:
Allocated new applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8050,
call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 172.26.71.120:37699
Call#26 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1479897867169_0005'
doesn't exist in RM.
> 	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> 	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
> {code}
> There is a stale application entry inside /ats/active directory. ATS stops requesting
when we remove this directory.
> [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active
> drwxrwx---   - hive hadoop          0 2016-11-23 13:54 /ats/active/application_1479897867169_0005
> This issue with ATS is exposed by Tez job as Tez uses putDomain method. On calling TimelineClientImpl#putDomain()
-> writeDomain() -> getAppAttemptDir() -> createApplicationDir() which creates a
application directory inside ATS activePath. After Tez job created this, it fails as unable
to connect to ATS. Now when ATS comes back, it scans activePath for every 60 seconds (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds)
and calls GetApplicationReport which leads to ApplicationNotFoundException in RM. 
> For this negative case - we can delete the appDirectory inside activePath from ATS EntityGroupFSTimelineStore#getAppState()
once the RM throws ApplicationNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message