From issues-return-191612-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon May 14 04:30:13 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 1816D18076D for ; Mon, 14 May 2018 04:30:11 +0200 (CEST) Received: (qmail 89335 invoked by uid 500); 14 May 2018 02:30:11 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 89326 invoked by uid 99); 14 May 2018 02:30:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 May 2018 02:30:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8AF8C18045C for ; Mon, 14 May 2018 02:30:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.061 X-Spam-Level: X-Spam-Status: No, score=-110.061 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mU1wNho9-upX for ; Mon, 14 May 2018 02:30:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9F3535FB6A for ; Mon, 14 May 2018 02:30:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8BDD1E0161 for ; Mon, 14 May 2018 02:30:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 35E27212B7 for ; Mon, 14 May 2018 02:30:00 +0000 (UTC) Date: Mon, 14 May 2018 02:30:00 +0000 (UTC) From: "Chun Chen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated SPARK-24266: ------------------------------ Description: {code} Warning: Ignoring non-spark config property: Default=system properties included when running spark-submit. 18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop specific files 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: N/A start time: N/A container images: N/A phase: Pending status: [] 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: N/A container images: N/A phase: Pending status: [] 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish... 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Running status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})] 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure java.io.EOFException at okio.RealBufferedSource.require(RealBufferedSource.java:60) at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure java.io.EOFException at okio.RealBufferedSource.require(RealBufferedSource.java:60) at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses: Container name: spark-kubernetes-driver Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 Container state: Running Container started at: 2018-05-11T06:50:27Z 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished. 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f {code} This is the client log when we encountered this issue. Spark client terminates while driver is still running. See the last few lines in the log {code} 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses: Container name: spark-kubernetes-driver Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 Container state: Running Container started at: 2018-05-11T06:50:27Z {code} After debuging, I add some logs both in spark and kubernetes-client. {code} # spark patch --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( override def eventReceived(action: Action, pod: Pod): Unit = { this.pod = Option(pod) + logInfo(s"action $action") action match { case Action.DELETED => podFinalStatus = POD_FAILED @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( } override def onClose(e: KubernetesClientException): Unit = { - logDebug(s"Stopping watching application $appId with last-observed phase $phase") + logInfo(s"Stopping watching application $appId with last-observed phase $phase, exception $e") closeWatch() } # kubernetes-client diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java index b82b64e6..20587ef0 100644 --- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java +++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java @@ -129,6 +129,7 @@ public class WatchConnectionManager FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: N/A start time: N/A container images: N/A phase: Pending status: [] 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: N/A container images: N/A phase: Pending status: [] 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish... 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Pending status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-64-293-980-1526021412180-driver namespace: tione-603074457 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver pod uid: 90558303-54e7-11e8-9e64-525400da65d8 creation time: 2018-05-11T06:50:17Z service account name: default volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 node name: tbds-100-98-45-69 start time: 2018-05-11T06:50:17Z container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 phase: Running status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})] 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure java.io.EOFException at okio.RealBufferedSource.require(RealBufferedSource.java:60) at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure java.io.EOFException at okio.RealBufferedSource.require(RealBufferedSource.java:60) at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses: Container name: spark-kubernetes-driver Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 Container state: Running Container started at: 2018-05-11T06:50:27Z 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished. 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f {code} This is the client log when we encountered this issue. Spark client terminates while driver is still running. After debuging, I add some logs both in spark and kubernetes-client. {code} # spark patch --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( override def eventReceived(action: Action, pod: Pod): Unit = { this.pod = Option(pod) + logInfo(s"action $action") action match { case Action.DELETED => podFinalStatus = POD_FAILED @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( } override def onClose(e: KubernetesClientException): Unit = { - logDebug(s"Stopping watching application $appId with last-observed phase $phase") + logInfo(s"Stopping watching application $appId with last-observed phase $phase, exception $e") closeWatch() } # kubernetes-client diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java index b82b64e6..20587ef0 100644 --- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java +++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java @@ -129,6 +129,7 @@ public class WatchConnectionManager Spark client terminates while driver is still running > ----------------------------------------------------- > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.3.0 > Reporter: Chun Chen > Priority: Major > > {code} > Warning: Ignoring non-spark config property: Default=system properties included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: N/A > start time: N/A > container images: N/A > phase: Pending > status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: tbds-100-98-45-69 > start time: N/A > container images: N/A > phase: Pending > status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: tbds-100-98-45-69 > start time: 2018-05-11T06:50:17Z > container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > phase: Pending > status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: tbds-100-98-45-69 > start time: 2018-05-11T06:50:17Z > container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > phase: Pending > status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: tbds-100-98-45-69 > start time: 2018-05-11T06:50:17Z > container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > phase: Pending > status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: > pod name: spark-64-293-980-1526021412180-driver > namespace: tione-603074457 > labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver > pod uid: 90558303-54e7-11e8-9e64-525400da65d8 > creation time: 2018-05-11T06:50:17Z > service account name: default > volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume, download-files, spark-init-secret, hadoop-properties, default-token-xvjt9 > node name: tbds-100-98-45-69 > start time: 2018-05-11T06:50:17Z > container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > phase: Running > status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z, additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})] > 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:60) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) > at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) > at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) > 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:60) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:73) > at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113) > at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) > 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses: > Container name: spark-kubernetes-driver > Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > Container state: Running > Container started at: 2018-05-11T06:50:27Z > 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished. > 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called > 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb > 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f > {code} > This is the client log when we encountered this issue. Spark client terminates while driver is still running. See the last few lines in the log > {code} > 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0) > 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses: > Container name: spark-kubernetes-driver > Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > Container state: Running > Container started at: 2018-05-11T06:50:27Z > {code} > After debuging, I add some logs both in spark and kubernetes-client. > {code} > # spark patch > --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala > +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala > @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( > > override def eventReceived(action: Action, pod: Pod): Unit = { > this.pod = Option(pod) > + logInfo(s"action $action") > action match { > case Action.DELETED => > podFinalStatus = POD_FAILED > @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl( > } > > override def onClose(e: KubernetesClientException): Unit = { > - logDebug(s"Stopping watching application $appId with last-observed phase $phase") > + logInfo(s"Stopping watching application $appId with last-observed phase $phase, exception $e") > closeWatch() > } > # kubernetes-client > diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java > index b82b64e6..20587ef0 100644 > --- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java > +++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java > @@ -129,6 +129,7 @@ public class WatchConnectionManager httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString); > } > } > + logger.info("resourceVersion " + resourceVersion.get()); > > if (this.resourceVersion.get() != null) { > httpUrlBuilder.addQueryParameter("resourceVersion", this.resourceVersion.get()); > @@ -214,12 +215,14 @@ public class WatchConnectionManager try { > WatchEvent event = readWatchEvent(message); > Object object = event.getObject(); > + logger.info("resourceVersion " + resourceVersion.get()); > if (object instanceof HasMetadata) { > @SuppressWarnings("unchecked") > T obj = (T) object; > // Dirty cast - should always be valid though > String currentResourceVersion = resourceVersion.get(); > String newResourceVersion = ((HasMetadata) obj).getMetadata().getResourceVersion(); > + logger.info("currentResourceVersion " + currentResourceVersion + " newResourceVersion " + newResourceVersion); > if (currentResourceVersion == null || currentResourceVersion.compareTo(newResourceVersion) < 0) { > resourceVersion.compareAndSet(currentResourceVersion, newResourceVersion); > } > @@ -244,6 +247,7 @@ public class WatchConnectionManager } > } else if (object instanceof Status) { > Status status = (Status) object; > + logger.info("status " + status.toString()); > > // The resource version no longer exists - this has to be handled by the caller. > if (status.getCode() == HTTP_GONE) { > {code} > It clearly print the following root errors. The issue lays in kubernetes-client that it simply throws the "too old resource version" exception (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266) to client, but spark client does not handle this. > {code} > 18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 32000 milliseconds (T5) > 18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111 > 18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, code=410, details=null, kind=Status, message=too old resource version: 21648111 (21653211), metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=Gone, status=Failure, additionalProperties={}) > 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching application spark-6bb7b853297a4892b9fb95a3a0ac0b3d with last-observed phase Running, exception io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 21648111 (21653211) > 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses: > Container name: spark-kubernetes-driver > Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 > Container state: Running > Container started at: 2018-05-13T05:21:01Z > 18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished. > 18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called > 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2 > 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f > {code} > I fixed the issue by hacking kubernetes-client, make it watch without specifying the resource version. I'm thinking a better solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org