Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D53A32004F3 for ; Tue, 15 Aug 2017 19:10:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D3B93167021; Tue, 15 Aug 2017 17:10:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 54B8A16701E for ; Tue, 15 Aug 2017 19:10:01 +0200 (CEST) Received: (qmail 97185 invoked by uid 500); 15 Aug 2017 17:09:58 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 97174 invoked by uid 99); 15 Aug 2017 17:09:58 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Aug 2017 17:09:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7F21EC00CE for ; Tue, 15 Aug 2017 17:09:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.49 X-Spam-Level: ** X-Spam-Status: No, score=2.49 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id dkqnPpW8v0nh for ; Tue, 15 Aug 2017 17:09:51 +0000 (UTC) Received: from mail-vk0-f51.google.com (mail-vk0-f51.google.com [209.85.213.51]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 495085F3D0 for ; Tue, 15 Aug 2017 17:09:50 +0000 (UTC) Received: by mail-vk0-f51.google.com with SMTP id r199so4526931vke.4 for ; Tue, 15 Aug 2017 10:09:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=b4irBp6VbAP7vUkDTPqPO9inITy5YGBv3bHoGUWD5rc=; b=cE9j4YbHYo6IHNOrsg7G4FZvAkTAqZKWkLxHMLCqSVpCzr5wuEHjQA47YUWQG/vThD /M9P4mZFCcy9GDyriNck7G04e57fCZr6xjtqd1h8be36yiH0x89nGGzNVeYh3Yd5Ig98 Ahj4tAgaIDKKqdnes0rHzXr70qQx0ldm4G/38gueo/OETt1PtNMQ5Fx0jJquRAzbLuge 1S8WbwHa/K6Ot6LOarWdhiACMEC7EQm6zqRKvaMNr7nudKLO5BMu28GbqEGZTnNHK9vP qia8FxnqtrUbp9NLyJvhnAKXe1UB8uoQ9M5qw8gxrUSNSLuvwUV++Gv6et6WLot102jt AZtA== X-Gm-Message-State: AHYfb5hmT/xAvY+6SIuTipaPOgETRoFBQJeIx3LxGCPNQq8FBRirUkNC lfAJnEaKf+F1RxA6IeDcldiejUptWg== X-Received: by 10.31.115.3 with SMTP id o3mr18719889vkc.87.1502816983275; Tue, 15 Aug 2017 10:09:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Holden Karau Date: Tue, 15 Aug 2017 17:09:32 +0000 Message-ID: Subject: Re: SPIP: Spark on Kubernetes To: William Benton , dev Content-Type: multipart/alternative; boundary="94eb2c14c416735f8b0556cdd8c1" archived-at: Tue, 15 Aug 2017 17:10:03 -0000 --94eb2c14c416735f8b0556cdd8c1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable +1 (non-binding) I (personally) think that Kubernetes as a scheduler backend should eventually get merged in and there is clearly a community interested in the work required to maintain it. On Tue, Aug 15, 2017 at 9:51 AM William Benton wrote: > +1 (non-binding) > > On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan < > foxish@google.com.invalid> wrote: > >> Spark on Kubernetes effort has been developed separately in a fork, and >> linked back from the Apache Spark project as an experimental backend >> . >> We're ~6 months in, have had 5 releases >> . >> >> - 2 Spark versions maintained (2.1, and 2.2) >> - Extensive integration testing and refactoring efforts to maintain >> code quality >> - Developer >> and >> user-facing >> documentation >> - 10+ consistent code contributors from different organizations >> involved >> in actively maintaining and using the project, with several more memb= ers >> involved in testing and providing feedback. >> - The community has delivered several talks on Spark-on-Kubernetes >> generating lots of feedback from users. >> - In addition to these, we've seen efforts spawn off such as: >> - HDFS on Kubernetes >> with >> Locality and Performance Experiments >> - Kerberized access >> to >> HDFS from Spark running on Kubernetes >> >> *Following the SPIP process, I'm putting this SPIP up for a vote.* >> >> - +1: Yeah, let's go forward and implement the SPIP. >> - +0: Don't really care. >> - -1: I don't think this is a good idea because of the following >> technical reasons. >> >> If there is any further clarification desired, on the design or the >> implementation, please feel free to ask questions or provide feedback. >> >> >> SPIP: Kubernetes as A Native Cluster Manager >> >> Full Design Doc: link >> >> >> JIRA: https://issues.apache.org/jira/browse/SPARK-18278 >> >> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377 >> >> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt >> Cheah, >> >> Ilan Filonenko, Sean Suchter, Kimoon Kim >> Background and Motivation >> >> Containerization and cluster management technologies are constantly >> evolving in the cluster computing world. Apache Spark currently implemen= ts >> support for Apache Hadoop YARN and Apache Mesos, in addition to providin= g >> its own standalone cluster manager. In 2014, Google announced developmen= t >> of Kubernetes which has its own unique feature >> set and differentiates itself from YARN and Mesos. Since its debut, it h= as >> seen contributions from over 1300 contributors with over 50000 commits. >> Kubernetes has cemented itself as a core player in the cluster computing >> world, and cloud-computing providers such as Google Container Engine, >> Google Compute Engine, Amazon Web Services, and Microsoft Azure support >> running Kubernetes clusters. >> >> This document outlines a proposal for integrating Apache Spark with >> Kubernetes in a first class way, adding Kubernetes to the list of cluste= r >> managers that Spark can be used with. Doing so would allow users to shar= e >> their computing resources and containerization framework between their >> existing applications on Kubernetes and their computational Spark >> applications. Although there is existing support for running a Spark >> standalone cluster on Kubernetes >> , >> there are still major advantages and significant interest in having nati= ve >> execution support. For example, this integration provides better support >> for multi-tenancy and dynamic resource allocation. It also allows users = to >> run applications of different Spark versions of their choices in the sam= e >> cluster. >> >> The feature is being developed in a separate fork >> in order to minimize risk >> to the main project during development. Since the start of the developme= nt >> in November of 2016, it has received over 100 commits from over 20 >> contributors and supports two releases based on Spark 2.1 and 2.2 >> respectively. Documentation is also being actively worked on both in the >> main project repository and also in the repository >> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world >> use cases, we have seen cluster setup that uses 1000+ cores. We are also >> seeing growing interests on this project from more and more organization= s. >> >> While it is easy to bootstrap the project in a forked repository, it is >> hard to maintain it in the long run because of the tricky process of >> rebasing onto the upstream and lack of awareness in the large Spark >> community. It would be beneficial to both the Spark and Kubernetes >> community seeing this feature being merged upstream. On one hand, it giv= es >> Spark users the option of running their Spark workloads along with other >> workloads that may already be running on Kubernetes, enabling better >> resource sharing and isolation, and better cluster administration. On th= e >> other hand, it gives Kubernetes a leap forward in the area of large-scal= e >> data processing by being an officially supported cluster manager for Spa= rk. >> The risk of merging into upstream is low because most of the changes are >> purely incremental, i.e., new Kubernetes-aware implementations of existi= ng >> interfaces/classes in Spark core are introduced. The development is also >> concentrated in a single place at resource-managers/kubernetes >> . >> The risk is further reduced by a comprehensive integration test framewor= k, >> and an active and responsive community of future maintainers. >> Target Personas >> >> Devops, data scientists, data engineers, application developers, anyone >> who can benefit from having Kubernetes >> as a >> native cluster manager for Spark. >> Goals >> >> - >> >> Make Kubernetes a first-class cluster manager for Spark, alongside >> Spark Standalone, Yarn, and Mesos. >> - >> >> Support both client and cluster deployment mode. >> - >> >> Support dynamic resource allocation >> >> . >> - >> >> Support Spark Java/Scala, PySpark, and Spark R applications. >> - >> >> Support secure HDFS access. >> - >> >> Allow running applications of different Spark versions in the same >> cluster through the ability to specify the driver and executor Docker >> images on a per-application basis. >> - >> >> Support specification and enforcement of limits on both CPU cores and >> memory. >> >> Non-Goals >> >> - >> >> Support cluster resource scheduling and sharing beyond capabilities >> offered natively by the Kubernetes per-namespace resource quota model= . >> >> Proposed API Changes >> >> Most API changes are purely incremental, i.e., new Kubernetes-aware >> implementations of existing interfaces/classes in Spark core are >> introduced. Detailed changes are as follows. >> >> - >> >> A new cluster manager option KUBERNETES is introduced and some >> changes are made to SparkSubmit to make it be aware of this option. >> - >> >> A new implementation of CoarseGrainedSchedulerBackend, namely >> KubernetesClusterSchedulerBackend is responsible for managing the >> creation and deletion of executor Pods through the Kubernetes API. >> - >> >> A new implementation of TaskSchedulerImpl, namely >> KubernetesTaskSchedulerImpl, and a new implementation of >> TaskSetManager, namely Kubernetes TaskSetManager, are introduced for >> Kubernetes-aware task scheduling. >> - >> >> When dynamic resource allocation is enabled, a new implementation of >> ExternalShuffleService, namely KubernetesExternalShuffleService is >> introduced. >> >> Design Sketch >> >> Below we briefly describe the design. For more details on the design and >> architecture, please refer to the architecture documentation >> . >> The main idea of this design is to run Spark driver and executors inside >> Kubernetes Pods . >> Pods are a co-located and co-scheduled group of one or more containers r= un >> in a shared context. The driver is responsible for creating and destroyi= ng >> executor Pods through the Kubernetes API, while Kubernetes is fully >> responsible for scheduling the Pods to run on available nodes in the >> cluster. In the cluster mode, the driver also runs in a Pod in the clust= er, >> created through the Kubernetes API by a Kubernetes-aware submission clie= nt >> called by the spark-submit script. Because the driver runs in a Pod, it >> is reachable by the executors in the cluster using its Pod IP. In the >> client mode, the driver runs outside the cluster and calls the Kubernete= s >> API to create and destroy executor Pods. The driver must be routable fro= m >> within the cluster for the executors to communicate with it. >> >> The main component running in the driver is the >> KubernetesClusterSchedulerBackend, an implementation of >> CoarseGrainedSchedulerBackend, which manages allocating and destroying >> executors via the Kubernetes API, as instructed by Spark core via calls = to >> methods doRequestTotalExecutors and doKillExecutors, respectively. >> Within the KubernetesClusterSchedulerBackend, a separate >> kubernetes-pod-allocator thread handles the creation of new executor >> Pods with appropriate throttling and monitoring. Throttling is achieved >> using a feedback loop that makes decision on submitting new requests for >> executors based on whether previous executor Pod creation requests have >> completed. This indirection is necessary because the Kubernetes API serv= er >> accepts requests for new Pods optimistically, with the anticipation of >> being able to eventually schedule them to run. However, it is undesirabl= e >> to have a very large number of Pods that cannot be scheduled and stay >> pending within the cluster. The throttling mechanism gives us control ov= er >> how fast an application scales up (which can be configured), and helps >> prevent Spark applications from DOS-ing the Kubernetes API server with t= oo >> many Pod creation requests. The executor Pods simply run the >> CoarseGrainedExecutorBackend class from a pre-built Docker image that >> contains a Spark distribution. >> >> There are auxiliary and optional components: ResourceStagingServer and >> KubernetesExternalShuffleService, which serve specific purposes >> described below. The ResourceStagingServer serves as a file store (in >> the absence of a persistent storage layer in Kubernetes) for application >> dependencies uploaded from the submission client machine, which then get >> downloaded from the server by the init-containers in the driver and >> executor Pods. It is a Jetty server with JAX-RS and has two endpoints fo= r >> uploading and downloading files, respectively. Security tokens are retur= ned >> in the responses for file uploading and must be carried in the requests = for >> downloading the files. The ResourceStagingServer is deployed as a >> Kubernetes Service >> >> backed by a Deployment >> >> in the cluster and multiple instances may be deployed in the same cluste= r. >> Spark applications specify which ResourceStagingServer instance to use >> through a configuration property. >> >> The KubernetesExternalShuffleService is used to support dynamic resource >> allocation, with which the number of executors of a Spark application ca= n >> change at runtime based on the resource needs. It provides an additional >> endpoint for drivers that allows the shuffle service to delete driver >> termination and clean up the shuffle files associated with corresponding >> application. There are two ways of deploying the >> KubernetesExternalShuffleService: running a shuffle service Pod on each >> node in the cluster or a subset of the nodes using a DaemonSet >> , >> or running a shuffle service container in each of the executor Pods. In = the >> first option, each shuffle service container mounts a hostPath >> volume. >> The same hostPath volume is also mounted by each of the executor >> containers, which must also have the environment variable >> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a shuffle >> service container is co-located with an executor container in each of th= e >> executor Pods. The two containers share an emptyDir >> volume >> where the shuffle data gets written to. There may be multiple instances = of >> the shuffle service deployed in a cluster that may be used for different >> versions of Spark, or for different priority levels with different resou= rce >> quotas. >> >> New Kubernetes-specific configuration options are also introduced to >> facilitate specification and customization of driver and executor Pods a= nd >> related Kubernetes resources. For example, driver and executor Pods can = be >> created in a particular Kubernetes namespace and on a particular set of = the >> nodes in the cluster. Users are allowed to apply labels and annotations = to >> the driver and executor Pods. >> >> Additionally, secure HDFS support is being actively worked on following >> the design here >> . >> Both short-running jobs and long-running jobs that need periodic delegat= ion >> token refresh are supported, leveraging built-in Kubernetes constructs l= ike >> Secrets. Please refer to the design doc for details. >> Rejected DesignsResource Staging by the Driver >> >> A first implementation effectively included the ResourceStagingServer in >> the driver container itself. The driver container ran a custom command t= hat >> opened an HTTP endpoint and waited for the submission client to send >> resources to it. The server would then run the driver code after it had >> received the resources from the submission client machine. The problem w= ith >> this approach is that the submission client needs to deploy the driver i= n >> such a way that the driver itself would be reachable from outside of the >> cluster, but it is difficult for an automated framework which is not awa= re >> of the cluster's configuration to expose an arbitrary pod in a generic w= ay. >> The Service-based design chosen allows a cluster administrator to expose >> the ResourceStagingServer in a manner that makes sense for their >> cluster, such as with an Ingress or with a NodePort service. >> Kubernetes External Shuffle Service >> >> Several alternatives were considered for the design of the shuffle >> service. The first design postulated the use of long-lived executor pods >> and sidecar containers in them running the shuffle service. The advantag= e >> of this model was that it would let us use emptyDir for sharing as oppos= ed >> to using node local storage, which guarantees better lifecycle managemen= t >> of storage by Kubernetes. The apparent disadvantage was that it would be= a >> departure from the traditional Spark methodology of keeping executors fo= r >> only as long as required in dynamic allocation mode. It would additional= ly >> use up more resources than strictly necessary during the course of >> long-running jobs, partially losing the advantage of dynamic scaling. >> >> Another alternative considered was to use a separate shuffle service >> manager as a nameserver. This design has a few drawbacks. First, this me= ans >> another component that needs authentication/authorization management and >> maintenance. Second, this separate component needs to be kept in sync wi= th >> the Kubernetes cluster. Last but not least, most of functionality of thi= s >> separate component can be performed by a combination of the in-cluster >> shuffle service and the Kubernetes API server. >> Pluggable Scheduler Backends >> >> Fully pluggable scheduler backends were considered as a more generalized >> solution, and remain interesting as a possible avenue for future-proofin= g >> against new scheduling targets. For the purposes of this project, addin= g a >> new specialized scheduler backend for Kubernetes was chosen as the appro= ach >> due to its very low impact on the core Spark code; making scheduler full= y >> pluggable would be a high-impact high-risk modification to Spark=E2=80= =99s core >> libraries. The pluggable scheduler backends effort is being tracked in >> JIRA-19700 . >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org >> > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau --94eb2c14c416735f8b0556cdd8c1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
+1 (non-binding)

I (personally) think that Kubernetes as a scheduler ba= ckend should eventually get merged in and there is clearly a community inte= rested in the work required to maintain it.

On Tue, Aug 15, 2017 at 9:51 AM William Benton <willb@redhat.com> wrote:
+1 (non-binding)
On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Rama= nathan <foxish@google.com.invalid> wrote:
Spark on= Kubernetes effort has been developed separately in a fork, and linked back= from the Apache Spark project as=C2=A0a= n experimental backend. We're ~6 months in, have had=C2=A05 releases.=C2=A0
  • 2 = Spark versions maintained (2.1, and 2.2)
  • Extensive integration testing and refactoring efforts to maintain code qua= lity
  • Developer=C2=A0= and=C2=A0user-facing=C2=A0documentation
  • 10+ consistent code contributors from=C2=A0different organizations=C2=A0involved in actively mainta= ining and using the project, with several more members involved in testing = and providing feedback.
  • The community ha= s delivered several talks on Spark-on-Kubernetes generating lots of feedbac= k from users.
  • In addition to these, we&#= 39;ve seen efforts spawn off such as:

Following th= e=C2=A0SPIP=C2=A0process, I&#= 39;m putting this=C2=A0SPIP= =C2=A0up for a vote.
  • +1: Yeah, let's go fo= rward and implement the SPIP.
  • +0: Do= n't really care.
  • -1: I don't= think this is a good idea because of the following technical reasons.
  • =
If there is any further clarification = desired, on the design or the implementation, please feel free to ask quest= ions or provide feedback.


SPIP: Kubernetes as A Native Cluster Manager

Full Design Doc: link

JIRA: https://issues.apache.org/jira/browse/SPARK-18278

K= ubernetes Issue: ht= tps://github.com/kubernetes/kubernetes/issues/34377


<= span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-= color:transparent;vertical-align:baseline;white-space:pre-wrap">Authors: Yi= nan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt Cheah,<= /p>

Ila= n Filonenko, Sean Suchter, Kimoon Kim

Background and Motivation

Containerization= and cluster management technologies are constantly evolving in the cluster= computing world. Apache Spark currently implements support for Apache Hado= op YARN and Apache Mesos, in addition to providing its own standalone clust= er manager. In 2014, Google announced development of <= span style=3D"font-size:11pt;font-family:Arial;background-color:transparent= ;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wra= p">Kubernetes which has its own unique feature set and differentiates itse= lf from YARN and Mesos. Since its debut, it has seen contributions from ove= r 1300 contributors with over 50000 commits. Kubernetes has cemented itself= as a core player in the cluster computing world, and cloud-computing provi= ders such as Google Container Engine, Google Compute Engine, Amazon Web Ser= vices, and Microsoft Azure support running Kubernetes clusters.

<= br>

This document outli= nes a proposal for integrating Apache Spark with Kubernetes in a first clas= s way, adding Kubernetes to the list of cluster managers that Spark can be = used with. Doing so would allow users to share their computing resources an= d containerization framework between their existing applications on Kuberne= tes and their computational Spark applications. Although there is existing = support for running a Spark standalone cluster on Kubernetes= , there ar= e still major advantages and significant interest in having native executio= n support. For example, this integration provides better support for multi-= tenancy and dynamic resource allocation. It also allows users to run applic= ations of different Spark versions of their choices in the same cluster.


The featu= re is being developed in a s= eparate fork in order to minimize risk to the main project during developm= ent. Since the start of the development in November of 2016, it has receive= d over 100 commits from over 20 contributors and supports two releases base= d on Spark 2.1 and 2.2 respectively. Documentation is also being actively w= orked on both in the main project repository and also in the repository https://github.com/apache-sp= ark-on-k8s/userdocs. Regarding real-world use cases, we have seen cluster = setup that uses 1000+ cores. We are also seeing growing interests on this p= roject from more and more organizations.


While it is easy to bootstrap the project = in a forked repository, it is hard to maintain it in the long run because o= f the tricky process of rebasing onto the upstream and lack of awareness in= the large Spark community. It would be beneficial to both the Spark and Ku= bernetes community seeing this feature being merged upstream. On one hand, = it gives Spark users the option of running their Spark workloads along with= other workloads that may already be running on Kubernetes, enabling better= resource sharing and isolation, and better cluster administration. On the = other hand, it gives Kubernetes a leap forward in the area of large-scale d= ata processing by being an officially supported cluster manager for Spark. = The risk of merging into upstream is low because most of the changes are pu= rely incremental, i.e., new Kubernetes-aware implementations of existing in= terfaces/classes in Spark core are introduced. The development is also conc= entrated in a single place at resource-managers= /kubernetes. The risk is further reduced by a comprehensive integration te= st framework, and an active and responsive community of future maintainers.=

Target Personas

Devops, data scientists, data engineers, application developers= , anyone who can benefit from having Kubernetes as a native cluster manager for Sp= ark.

Goals

  • Make Kubernetes a first-class cluster manager for Spark, alongside S= park Standalone, Yarn, and Mesos.

  • Support both client = and cluster deployment mode.

  • Support dynamic resource allocat= ion.

  • Support = Spark Java/Scala, PySpark, and Spark R applications.

  • S= upport secure HDFS access.

  • Allow running applications = of different Spark versions in the same cluster through the ability to spec= ify the driver and executor Docker images on a per-application basis.

  • Support specification and enforcement of limits on both CPU= cores and memory.

Non-Goals

  • Support cluster resource scheduling and s= haring beyond capabilities offered natively by the Kubernetes per-namespace= resource quota model.

Proposed API Changes

Most API changes are purel= y incremental, i.e., new Kubernetes-aware implementations of existing inter= faces/classes in Spark core are introduced. Detailed changes are as follows= .

  • A new clus= ter manager option KUBERNETES is introdu= ced and some changes are made to SparkSubmit to make it be aware of this option.

  • A new implem= entation of CoarseGrainedSchedulerBackend, namely KubernetesClusterSchedulerBackend is responsible for managing the creation and deletion of executor Pods = through the Kubernetes API.

  • A new implementation of Task= SchedulerImpl, namely KubernetesTaskSche= dulerImpl, and a new implementation of= T= askSetManager, namely Kubernetes TaskSet= Manager, are introduced for Kubernetes= -aware task scheduling.

  • When dynamic resource allocati= on is enabled, a new implementation of ExternalShuffleService, namely KubernetesExternalShuffleService is introduced.

Design Sketch<= /h1>

Below we briefly d= escribe the design. For more details on the design and architecture, please= refer to the architecture do= cumentation. The main idea of this design is to run Spark driver and execu= tors inside Kubernetes Pods. Pods are a co-located and co-scheduled group of one or more co= ntainers run in a shared context. The driver is responsible for creating an= d destroying executor Pods through the Kubernetes API, while Kubernetes is = fully responsible for scheduling the Pods to run on available nodes in the = cluster. In the cluster mode, the driver also runs in a Pod in the cluster,= created through the Kubernetes API by a Kubernetes-aware submission client= called by the spark-submit script. Because the driver runs in = a Pod, it is reachable by the executors in the cluster using its Pod IP. In= the client mode, the driver runs outside the cluster and calls the Kuberne= tes API to create and destroy executor Pods. The driver must be routable fr= om within the cluster for the executors to communicate with it.

=

The main component= running in the driver is the KubernetesClusterSchedulerBack= end= , an implementation of CoarseGrainedSchedulerBackend<= span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-= color:transparent;vertical-align:baseline;white-space:pre-wrap">, which man= ages allocating and destroying executors via the Kubernetes API, as instruc= ted by Spark core via calls to methods doRequestTotalExecuto= rs = and doKillExecutors, respectively. Within the Ku= bernetesClusterSchedulerBackend, a separate kubernetes-pod-allo= catorCoarseGrainedExecutorBackend class from a pre-built = Docker image that contains a Spark distribution.


There are auxiliary and optional = components: ResourceStagingServer and Kubernetes= ExternalShuffleService, which serve specific purposes described below. The ResourceStagingServer serves as a file store (in the absence of a = persistent storage layer in Kubernetes) for application dependencies upload= ed from the submission client machine, which then get downloaded from the s= erver by the init-containers in the driver and executor Pods. It is a Jetty= server with JAX-RS and has two endpoints for uploading and downloading fil= es, respectively. Security tokens are returned in the responses for file up= loading and must be carried in the requests for downloading the files. The = ResourceStagingServer is deployed as a Kubernetes Service backed by a = Dep= loyment in the cluster and multiple instances may be deployed in the same = cluster. Spark applications specify which ResourceStagingSer= ver= instance to use through a configuration property.


The KubernetesE= xternalShuffleService is used to support dynamic resource allocation, with whi= ch the number of executors of a Spark application can change at runtime bas= ed on the resource needs. It provides an additional endpoint for drivers th= at allows the shuffle service to delete driver termination and clean up the= shuffle files associated with corresponding application. There are two way= s of deploying the KubernetesExternalShuffleService: running a = shuffle service Pod on each node in the cluster or a subset of the nodes us= ing a , or running a shuffle service container in each of the executo= r Pods. In the first option, each shuffle service container mounts a hostPath volume.= The same hostPath volume is also mounted by each of the executor container= s, which must also have the environment variable SPARK_LOCAL= _DIRSemptyDir volume where the shuffle data gets written to. Th= ere may be multiple instances of the shuffle service deployed in a cluster = that may be used for different versions of Spark, or for different priority= levels with different resource quotas.


New Kubernetes-specific configuration optio= ns are also introduced to facilitate specification and customization of dri= ver and executor Pods and related Kubernetes resources. For example, driver= and executor Pods can be created in a particular Kubernetes namespace and = on a particular set of the nodes in the cluster. Users are allowed to apply= labels and annotations to the driver and executor Pods.


Additionally, secure HDFS = support is being actively worked on following the design here.= Both short-running jobs and long-running jobs that need periodic delegatio= n token refresh are supported, leveraging built-in Kubernetes constructs li= ke Secrets. Please refer to the design doc for details.

Rejected Des= igns

Resource Staging by the Driver

A first implementation effectively included= the ResourceStagingServer in the driver container itself. The = driver container ran a custom command that opened an HTTP endpoint and wait= ed for the submission client to send resources to it. The server would then= run the driver code after it had received the resources from the submissio= n client machine. The problem with this approach is that the submission cli= ent needs to deploy the driver in such a way that the driver itself would b= e reachable from outside of the cluster, but it is difficult for an automat= ed framework which is not aware of the cluster's configuration to expos= e an arbitrary pod in a generic way. The Service-based design chosen allows= a cluster administrator to expose the ResourceStagingServer= in= a manner that makes sense for their cluster, such as with an Ingress or wi= th a NodePort service.

Kubernetes External Shuffle Service

Several alternatives = were considered for the design of the shuffle service. The first design pos= tulated the use of long-lived executor pods and sidecar containers in them = running the shuffle service. The advantage of this model was that it would = let us use emptyDir for sharing as opposed to using node local storage, whi= ch guarantees better lifecycle management of storage by Kubernetes. The app= arent disadvantage was that it would be a departure from the traditional Sp= ark methodology of keeping executors for only as long as required in dynami= c allocation mode. It would additionally use up more resources than strictl= y necessary during the course of long-running jobs, partially losing the ad= vantage of dynamic scaling.


Another alternative considered was to use a separate sh= uffle service manager as a nameserver. This design has a few drawbacks. Fir= st, this means another component that needs authentication/authorization ma= nagement and maintenance. Second, this separate component needs to be kept = in sync with the Kubernetes cluster. Last but not least, most of functional= ity of this separate component can be performed by a combination of the in-= cluster shuffle service and the Kubernetes API server.

Pluggable Sch= eduler Backends

Fully pluggable scheduler backends were considered as a more generaliz= ed solution, and remain interesting as a possible avenue for future-proofin= g against new scheduling targets.=C2=A0 For the purposes of this project, a= dding a new specialized scheduler backend for Kubernetes was chosen as the = approach due to its very low impact on the core Spark code; making schedule= r fully pluggable would be a high-impact high-risk modification to Spark=E2= =80=99s core libraries. The pluggable scheduler backends effort is being tr= acked in JIRA-19700.




---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

--
<= div dir=3D"ltr">
Cell : 425-233-8271
--94eb2c14c416735f8b0556cdd8c1--