Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A8ADA200BCC for ; Tue, 29 Nov 2016 12:07:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A7356160B05; Tue, 29 Nov 2016 11:07:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CC1AD160B27 for ; Tue, 29 Nov 2016 12:06:59 +0100 (CET) Received: (qmail 11569 invoked by uid 500); 29 Nov 2016 11:06:58 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 11197 invoked by uid 99); 29 Nov 2016 11:06:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Nov 2016 11:06:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 78FE82C03E8 for ; Tue, 29 Nov 2016 11:06:58 +0000 (UTC) Date: Tue, 29 Nov 2016 11:06:58 +0000 (UTC) From: "Prabhu Joseph (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 29 Nov 2016 11:07:00 -0000 [ https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704980#comment-15704980 ] Prabhu Joseph commented on YARN-5933: ------------------------------------- Thanks [~gtCarrera9], looks not a simple one to directly remove unknown appDir. Assume there are 10 tez jobs failed when ATS is down, then there will be 10 * unknownActiveSecs / scanIntervalSecs = 14400 ApplicationNotFoundException stacktrace will be in RM throughout that entire day logs. If there is no impact other than flooding of RM logs, is it better to change the ApplicationNotFoundException stacktrace into a single WARN message. > ATS stale entries in active directory causes ApplicationNotFoundException in RM > ------------------------------------------------------------------------------- > > Key: YARN-5933 > URL: https://issues.apache.org/jira/browse/YARN-5933 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.3 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > > On Secure cluster where ATS is down, Tez job submitted will fail while getting TIMELINE_DELEGATION_TOKEN with below exception > {code} > 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from alltypesorc group by csmallint; > INFO : Session is already open > INFO : Dag name: select csmallint from alltypesor...csmallint(Stage-1) > INFO : Tez session was closed. Reopening... > ERROR : Failed to execute tez graph. > java.lang.RuntimeException: Failed to connect to timeline server. Connection retries limit exceeded. The posted timeline event may be missing > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506) > at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349) > at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330) > at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250) > at org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72) > at org.apache.tez.client.TezClient.start(TezClient.java:409) > at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196) > at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311) > at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453) > at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) > at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121) > at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154) > at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71) > at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Tez YarnClient has received an applicationID from RM. On Restarting ATS now, ATS tries to get the application report from RM and so RM will throw ApplicationNotFoundException. ATS will keep on requesting and which floods RM. > {code} > RM logs: > 2016-11-23 13:53:57,345 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 5 > 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8050, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 172.26.71.120:37699 Call#26 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1479897867169_0005' doesn't exist in RM. > at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328) > at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200) > {code} > There is a stale application entry inside /ats/active directory. ATS stops requesting when we remove this directory. > [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active > drwxrwx--- - hive hadoop 0 2016-11-23 13:54 /ats/active/application_1479897867169_0005 > This issue with ATS is exposed by Tez job as Tez uses putDomain method. On calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir() -> createApplicationDir() which creates a application directory inside ATS activePath. After Tez job created this, it fails as unable to connect to ATS. Now when ATS comes back, it scans activePath for every 60 seconds (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls GetApplicationReport which leads to ApplicationNotFoundException in RM. > For this negative case - we can delete the appDirectory inside activePath from ATS EntityGroupFSTimelineStore#getAppState() once the RM throws ApplicationNotFoundException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org