spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chun Chen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running
Date Mon, 14 May 2018 02:30:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chun Chen updated SPARK-24266:
------------------------------
    Description: 

{code}
Warning: Ignoring non-spark config property: Default=system properties included when running
spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token].
Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be
used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop specific
files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: N/A
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Running
	 status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368,
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02,
lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z,
additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),
additionalProperties={})]
18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 Container name: spark-kubernetes-driver
	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 Container state: Running
	 Container started at: 2018-05-11T06:50:27Z
18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished.
18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called
18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb
18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f
{code}

This is the client log when we encountered this issue. Spark client terminates while driver
is still running. See the last few lines in the log 
{code}
18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 Container name: spark-kubernetes-driver
	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 Container state: Running
	 Container started at: 2018-05-11T06:50:27Z
{code}

After debuging, I add some logs both in spark and kubernetes-client. 

{code}
# spark patch

--- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
+++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
@@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
 
   override def eventReceived(action: Action, pod: Pod): Unit = {
     this.pod = Option(pod)
+    logInfo(s"action $action")
     action match {
       case Action.DELETED =>
         podFinalStatus = POD_FAILED
@@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
   }
 
   override def onClose(e: KubernetesClientException): Unit = {
-    logDebug(s"Stopping watching application $appId with last-observed phase $phase")
+    logInfo(s"Stopping watching application $appId with last-observed phase $phase, exception
$e")
     closeWatch()
   }

# kubernetes-client

diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
index b82b64e6..20587ef0 100644
--- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
+++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
@@ -129,6 +129,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
         httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString);
       }
     }
+    logger.info("resourceVersion " + resourceVersion.get());
 
     if (this.resourceVersion.get() != null) {
       httpUrlBuilder.addQueryParameter("resourceVersion", this.resourceVersion.get());
@@ -214,12 +215,14 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
         try {
           WatchEvent event = readWatchEvent(message);
           Object object = event.getObject();
+          logger.info("resourceVersion " + resourceVersion.get());
           if (object instanceof HasMetadata) {
             @SuppressWarnings("unchecked")
             T obj = (T) object;
             // Dirty cast - should always be valid though
             String currentResourceVersion = resourceVersion.get();
             String newResourceVersion = ((HasMetadata) obj).getMetadata().getResourceVersion();
+            logger.info("currentResourceVersion " + currentResourceVersion + " newResourceVersion
" + newResourceVersion);
             if (currentResourceVersion == null || currentResourceVersion.compareTo(newResourceVersion)
< 0) {
               resourceVersion.compareAndSet(currentResourceVersion, newResourceVersion);
             }
@@ -244,6 +247,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
             }
           } else if (object instanceof Status) {
             Status status = (Status) object;
+            logger.info("status " + status.toString());
 
             // The resource version no longer exists - this has to be handled by the caller.
             if (status.getCode() == HTTP_GONE) {

{code}

It clearly print the following root errors. The issue lays in kubernetes-client that it simply
throws the "too old resource version" exception (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266)
to client, but spark client does not handle this.  

{code}
18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 32000 milliseconds
(T5)
18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111
18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, code=410, details=null,
kind=Status, message=too old resource version: 21648111 (21653211), metadata=ListMeta(resourceVersion=null,
selfLink=null, additionalProperties={}), reason=Gone, status=Failure, additionalProperties={})
18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching application spark-6bb7b853297a4892b9fb95a3a0ac0b3d
with last-observed phase Running, exception io.fabric8.kubernetes.client.KubernetesClientException:
too old resource version: 21648111 (21653211)
18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 Container name: spark-kubernetes-driver
	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 Container state: Running
	 Container started at: 2018-05-13T05:21:01Z
18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished.
18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called
18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2
18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f
{code}

I fixed the issue by hacking kubernetes-client, make it watch without specifying the resource
version. I'm thinking a better solution.


  was:



{code}
Warning: Ignoring non-spark config property: Default=system properties included when running
spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token].
Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be
used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop specific
files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: N/A
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Pending
	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-64-293-980-1526021412180-driver
	 namespace: tione-603074457
	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
	 creation time: 2018-05-11T06:50:17Z
	 service account name: default
	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files,
spark-init-secret, hadoop-properties, default-token-xvjt9
	 node name: tbds-100-98-45-69
	 start time: 2018-05-11T06:50:17Z
	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 phase: Running
	 status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368,
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02,
lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z,
additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),
additionalProperties={})]
18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 Container name: spark-kubernetes-driver
	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 Container state: Running
	 Container started at: 2018-05-11T06:50:27Z
18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished.
18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called
18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb
18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f
{code}

This is the client log when we encountered this issue. Spark client terminates while driver
is still running.

After debuging, I add some logs both in spark and kubernetes-client. 

{code}
# spark patch

--- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
+++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
@@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
 
   override def eventReceived(action: Action, pod: Pod): Unit = {
     this.pod = Option(pod)
+    logInfo(s"action $action")
     action match {
       case Action.DELETED =>
         podFinalStatus = POD_FAILED
@@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
   }
 
   override def onClose(e: KubernetesClientException): Unit = {
-    logDebug(s"Stopping watching application $appId with last-observed phase $phase")
+    logInfo(s"Stopping watching application $appId with last-observed phase $phase, exception
$e")
     closeWatch()
   }

# kubernetes-client

diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
index b82b64e6..20587ef0 100644
--- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
+++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
@@ -129,6 +129,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
         httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString);
       }
     }
+    logger.info("resourceVersion " + resourceVersion.get());
 
     if (this.resourceVersion.get() != null) {
       httpUrlBuilder.addQueryParameter("resourceVersion", this.resourceVersion.get());
@@ -214,12 +215,14 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
         try {
           WatchEvent event = readWatchEvent(message);
           Object object = event.getObject();
+          logger.info("resourceVersion " + resourceVersion.get());
           if (object instanceof HasMetadata) {
             @SuppressWarnings("unchecked")
             T obj = (T) object;
             // Dirty cast - should always be valid though
             String currentResourceVersion = resourceVersion.get();
             String newResourceVersion = ((HasMetadata) obj).getMetadata().getResourceVersion();
+            logger.info("currentResourceVersion " + currentResourceVersion + " newResourceVersion
" + newResourceVersion);
             if (currentResourceVersion == null || currentResourceVersion.compareTo(newResourceVersion)
< 0) {
               resourceVersion.compareAndSet(currentResourceVersion, newResourceVersion);
             }
@@ -244,6 +247,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
             }
           } else if (object instanceof Status) {
             Status status = (Status) object;
+            logger.info("status " + status.toString());
 
             // The resource version no longer exists - this has to be handled by the caller.
             if (status.getCode() == HTTP_GONE) {

{code}

It clearly print the following root errors. The issue lays in kubernetes-client that it simply
throws the "too old resource version" exception (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266)
to client, but spark client does not handle this.  

{code}
18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 32000 milliseconds
(T5)
18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111
18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, code=410, details=null,
kind=Status, message=too old resource version: 21648111 (21653211), metadata=ListMeta(resourceVersion=null,
selfLink=null, additionalProperties={}), reason=Gone, status=Failure, additionalProperties={})
18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching application spark-6bb7b853297a4892b9fb95a3a0ac0b3d
with last-observed phase Running, exception io.fabric8.kubernetes.client.KubernetesClientException:
too old resource version: 21648111 (21653211)
18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 Container name: spark-kubernetes-driver
	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
	 Container state: Running
	 Container started at: 2018-05-13T05:21:01Z
18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished.
18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called
18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2
18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f
{code}

I fixed the issue by hacking kubernetes-client, make it watch without specifying the resource
version. I'm thinking a better solution.



> Spark client terminates while driver is still running
> -----------------------------------------------------
>
>                 Key: SPARK-24266
>                 URL: https://issues.apache.org/jira/browse/SPARK-24266
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Chun Chen
>            Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties included when
running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token].
Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop
specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: N/A
> 	 start time: N/A
> 	 container images: N/A
> 	 phase: Pending
> 	 status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: N/A
> 	 container images: N/A
> 	 phase: Pending
> 	 status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Running
> 	 status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368,
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02,
lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z,
additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),
additionalProperties={})]
> 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
> 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
> 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
> 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
> 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
> 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
> 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
> 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
> 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
> 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
> 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
> 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
> 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
> 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
> 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-11T06:50:27Z
> 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished.
> 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f
> {code}
> This is the client log when we encountered this issue. Spark client terminates while
driver is still running. See the last few lines in the log 
> {code}
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-11T06:50:27Z
> {code}
> After debuging, I add some logs both in spark and kubernetes-client. 
> {code}
> # spark patch
> --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>  
>    override def eventReceived(action: Action, pod: Pod): Unit = {
>      this.pod = Option(pod)
> +    logInfo(s"action $action")
>      action match {
>        case Action.DELETED =>
>          podFinalStatus = POD_FAILED
> @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>    }
>  
>    override def onClose(e: KubernetesClientException): Unit = {
> -    logDebug(s"Stopping watching application $appId with last-observed phase $phase")
> +    logInfo(s"Stopping watching application $appId with last-observed phase $phase,
exception $e")
>      closeWatch()
>    }
> # kubernetes-client
> diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> index b82b64e6..20587ef0 100644
> --- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> +++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> @@ -129,6 +129,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
>          httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString);
>        }
>      }
> +    logger.info("resourceVersion " + resourceVersion.get());
>  
>      if (this.resourceVersion.get() != null) {
>        httpUrlBuilder.addQueryParameter("resourceVersion", this.resourceVersion.get());
> @@ -214,12 +215,14 @@ public class WatchConnectionManager<T extends HasMetadata, L
extends KubernetesR
>          try {
>            WatchEvent event = readWatchEvent(message);
>            Object object = event.getObject();
> +          logger.info("resourceVersion " + resourceVersion.get());
>            if (object instanceof HasMetadata) {
>              @SuppressWarnings("unchecked")
>              T obj = (T) object;
>              // Dirty cast - should always be valid though
>              String currentResourceVersion = resourceVersion.get();
>              String newResourceVersion = ((HasMetadata) obj).getMetadata().getResourceVersion();
> +            logger.info("currentResourceVersion " + currentResourceVersion + " newResourceVersion
" + newResourceVersion);
>              if (currentResourceVersion == null || currentResourceVersion.compareTo(newResourceVersion)
< 0) {
>                resourceVersion.compareAndSet(currentResourceVersion, newResourceVersion);
>              }
> @@ -244,6 +247,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
>              }
>            } else if (object instanceof Status) {
>              Status status = (Status) object;
> +            logger.info("status " + status.toString());
>  
>              // The resource version no longer exists - this has to be handled by the
caller.
>              if (status.getCode() == HTTP_GONE) {
> {code}
> It clearly print the following root errors. The issue lays in kubernetes-client that
it simply throws the "too old resource version" exception (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266)
to client, but spark client does not handle this.  
> {code}
> 18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 32000 milliseconds
(T5)
> 18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111
> 18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, code=410,
details=null, kind=Status, message=too old resource version: 21648111 (21653211), metadata=ListMeta(resourceVersion=null,
selfLink=null, additionalProperties={}), reason=Gone, status=Failure, additionalProperties={})
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching application spark-6bb7b853297a4892b9fb95a3a0ac0b3d
with last-observed phase Running, exception io.fabric8.kubernetes.client.KubernetesClientException:
too old resource version: 21648111 (21653211)
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-13T05:21:01Z
> 18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished.
> 18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f
> {code}
> I fixed the issue by hacking kubernetes-client, make it watch without specifying the
resource version. I'm thinking a better solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message