spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saisai Shao (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-23361) Driver restart fails if it happens after 7 days from app submission
Date Fri, 23 Mar 2018 06:00:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Saisai Shao reassigned SPARK-23361:
-----------------------------------

    Assignee: Marcelo Vanzin

> Driver restart fails if it happens after 7 days from app submission
> -------------------------------------------------------------------
>
>                 Key: SPARK-23361
>                 URL: https://issues.apache.org/jira/browse/SPARK-23361
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.0
>            Reporter: Marcelo Vanzin
>            Assignee: Marcelo Vanzin
>            Priority: Major
>
> If you submit an app that is supposed to run for > 7 days (so using \-\principal /
\-\-keytab in cluster mode), and there's a failure that causes the driver to restart after
7 days (that being the default token lifetime for HDFS), the new driver will fail with an
error like the following:
> {noformat}
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
token (lots of uninteresting token info) can't be found in cache
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1409)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> 	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> 	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
> {noformat}
> Note: lines may not align with actual Apache code because that's our internal build.
> This happens because as part of the app submission, the launcher provides delegation
tokens to be used by the AM (=driver in this case), and those are expired at that point in
time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message