Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAMcCKecEL8HcHw=bdqCw1+TWaxsbTpkybAeOFcO01HQEBommog@mail.gmail.com>
References: <CADDfBqAMQOigu3LpnX8otwpohpaFcWmax7No0tYqRBMLNfgMNw@mail.gmail.com>
 <CAMcCKecEL8HcHw=bdqCw1+TWaxsbTpkybAeOFcO01HQEBommog@mail.gmail.com>
From: Erik Erlandson <eerlands@redhat.com>
Date: Tue, 15 Aug 2017 11:11:40 -0700
Message-ID: <CAMcCKefMAD_fDG9zUHz3B8EnLODjFPhJbuezLL7A7cBsGxoZqg@mail.gmail.com>
Subject: Re: SPIP: Spark on Kubernetes
To: dev@spark.apache.org
Content-Type: multipart/alternative; boundary="001a1142dfa414aeb60556ceb62f"
archived-at: Tue, 15 Aug 2017 18:11:55 -0000

--001a1142dfa414aeb60556ceb62f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Kubernetes has evolved into an important container orchestration platform;
it has a large and growing user base and an active ecosystem.  Users of
Apache Spark who are also deploying applications on Kubernetes (or are
planning to) will have convergence-related motivations for migrating their
Spark applications to Kubernetes as well. It avoids the need for deploying
separate cluster infra for Spark workloads and allows Spark applications to
take full advantage of inhabiting the same orchestration environment as
other applications.  In this respect, native Kubernetes support for Spark
represents a way to optimize uptake and retention of Apache Spark among the
members of the expanding Kubernetes community.

On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson <eerlands@redhat.com> wrote=
:

> +1 (non-binding)
>
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <foxish@google.com>
> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manag=
er-types>.
>> We're ~6 months in, have had 5 releases
>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>
>>    - 2 Spark versions maintained (2.1, and 2.2)
>>    - Extensive integration testing and refactoring efforts to maintain
>>    code quality
>>    - Developer
>>    <https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>    user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>    mentation
>>    - 10+ consistent code contributors from different organizations
>>    <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#proje=
ct-contributions> involved
>>    in actively maintaining and using the project, with several more memb=
ers
>>    involved in testing and providing feedback.
>>    - The community has delivered several talks on Spark-on-Kubernetes
>>    generating lots of feedback from users.
>>    - In addition to these, we've seen efforts spawn off such as:
>>    - HDFS on Kubernetes
>>       <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>       Locality and Performance Experiments
>>       - Kerberized access
>>       <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1=
RwpU_ewFuCNWKg/edit> to
>>       HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>    - +1: Yeah, let's go forward and implement the SPIP.
>>    - +0: Don't really care.
>>    - -1: I don't think this is a good idea because of the following
>>    technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%2=
0Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implemen=
ts
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providin=
g
>> its own standalone cluster manager. In 2014, Google announced developmen=
t
>> of Kubernetes <https://kubernetes.io/> which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it h=
as
>> seen contributions from over 1300 contributors with over 50000 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluste=
r
>> managers that Spark can be used with. Doing so would allow users to shar=
e
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README=
.md>,
>> there are still major advantages and significant interest in having nati=
ve
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users =
to
>> run applications of different Spark versions of their choices in the sam=
e
>> cluster.
>>
>> The feature is being developed in a separate fork
>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize risk
>> to the main project during development. Since the start of the developme=
nt
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world
>> use cases, we have seen cluster setup that uses 1000+ cores. We are also
>> seeing growing interests on this project from more and more organization=
s.
>>
>> While it is easy to bootstrap the project in a forked repository, it is
>> hard to maintain it in the long run because of the tricky process of
>> rebasing onto the upstream and lack of awareness in the large Spark
>> community. It would be beneficial to both the Spark and Kubernetes
>> community seeing this feature being merged upstream. On one hand, it giv=
es
>> Spark users the option of running their Spark workloads along with other
>> workloads that may already be running on Kubernetes, enabling better
>> resource sharing and isolation, and better cluster administration. On th=
e
>> other hand, it gives Kubernetes a leap forward in the area of large-scal=
e
>> data processing by being an officially supported cluster manager for Spa=
rk.
>> The risk of merging into upstream is low because most of the changes are
>> purely incremental, i.e., new Kubernetes-aware implementations of existi=
ng
>> interfaces/classes in Spark core are introduced. The development is also
>> concentrated in a single place at resource-managers/kubernetes
>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes=
/resource-managers/kubernetes>.
>> The risk is further reduced by a comprehensive integration test framewor=
k,
>> and an active and responsive community of future maintainers.
>> Target Personas
>>
>> Devops, data scientists, data engineers, application developers, anyone
>> who can benefit from having Kubernetes
>> <https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/> as a
>> native cluster manager for Spark.
>> Goals
>>
>>    -
>>
>>    Make Kubernetes a first-class cluster manager for Spark, alongside
>>    Spark Standalone, Yarn, and Mesos.
>>    -
>>
>>    Support both client and cluster deployment mode.
>>    -
>>
>>    Support dynamic resource allocation
>>    <http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-reso=
urce-allocation>
>>    .
>>    -
>>
>>    Support Spark Java/Scala, PySpark, and Spark R applications.
>>    -
>>
>>    Support secure HDFS access.
>>    -
>>
>>    Allow running applications of different Spark versions in the same
>>    cluster through the ability to specify the driver and executor Docker
>>    images on a per-application basis.
>>    -
>>
>>    Support specification and enforcement of limits on both CPU cores and
>>    memory.
>>
>> Non-Goals
>>
>>    -
>>
>>    Support cluster resource scheduling and sharing beyond capabilities
>>    offered natively by the Kubernetes per-namespace resource quota model=
.
>>
>> Proposed API Changes
>>
>> Most API changes are purely incremental, i.e., new Kubernetes-aware
>> implementations of existing interfaces/classes in Spark core are
>> introduced. Detailed changes are as follows.
>>
>>    -
>>
>>    A new cluster manager option KUBERNETES is introduced and some
>>    changes are made to SparkSubmit to make it be aware of this option.
>>    -
>>
>>    A new implementation of CoarseGrainedSchedulerBackend, namely
>>    KubernetesClusterSchedulerBackend is responsible for managing the
>>    creation and deletion of executor Pods through the Kubernetes API.
>>    -
>>
>>    A new implementation of TaskSchedulerImpl, namely
>>    KubernetesTaskSchedulerImpl, and a new implementation of
>>    TaskSetManager, namely Kubernetes TaskSetManager, are introduced for
>>    Kubernetes-aware task scheduling.
>>    -
>>
>>    When dynamic resource allocation is enabled, a new implementation of
>>    ExternalShuffleService, namely KubernetesExternalShuffleService is
>>    introduced.
>>
>> Design Sketch
>>
>> Below we briefly describe the design. For more details on the design and
>> architecture, please refer to the architecture documentation
>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes=
/resource-managers/kubernetes/architecture-docs>.
>> The main idea of this design is to run Spark driver and executors inside
>> Kubernetes Pods <https://kubernetes.io/docs/concepts/workloads/pods/pod/=
>.
>> Pods are a co-located and co-scheduled group of one or more containers r=
un
>> in a shared context. The driver is responsible for creating and destroyi=
ng
>> executor Pods through the Kubernetes API, while Kubernetes is fully
>> responsible for scheduling the Pods to run on available nodes in the
>> cluster. In the cluster mode, the driver also runs in a Pod in the clust=
er,
>> created through the Kubernetes API by a Kubernetes-aware submission clie=
nt
>> called by the spark-submit script. Because the driver runs in a Pod, it
>> is reachable by the executors in the cluster using its Pod IP. In the
>> client mode, the driver runs outside the cluster and calls the Kubernete=
s
>> API to create and destroy executor Pods. The driver must be routable fro=
m
>> within the cluster for the executors to communicate with it.
>>
>> The main component running in the driver is the
>> KubernetesClusterSchedulerBackend, an implementation of
>> CoarseGrainedSchedulerBackend, which manages allocating and destroying
>> executors via the Kubernetes API, as instructed by Spark core via calls =
to
>> methods doRequestTotalExecutors and doKillExecutors, respectively.
>> Within the KubernetesClusterSchedulerBackend, a separate
>> kubernetes-pod-allocator thread handles the creation of new executor
>> Pods with appropriate throttling and monitoring. Throttling is achieved
>> using a feedback loop that makes decision on submitting new requests for
>> executors based on whether previous executor Pod creation requests have
>> completed. This indirection is necessary because the Kubernetes API serv=
er
>> accepts requests for new Pods optimistically, with the anticipation of
>> being able to eventually schedule them to run. However, it is undesirabl=
e
>> to have a very large number of Pods that cannot be scheduled and stay
>> pending within the cluster. The throttling mechanism gives us control ov=
er
>> how fast an application scales up (which can be configured), and helps
>> prevent Spark applications from DOS-ing the Kubernetes API server with t=
oo
>> many Pod creation requests. The executor Pods simply run the
>> CoarseGrainedExecutorBackend class from a pre-built Docker image that
>> contains a Spark distribution.
>>
>> There are auxiliary and optional components: ResourceStagingServer and
>> KubernetesExternalShuffleService, which serve specific purposes
>> described below. The ResourceStagingServer serves as a file store (in
>> the absence of a persistent storage layer in Kubernetes) for application
>> dependencies uploaded from the submission client machine, which then get
>> downloaded from the server by the init-containers in the driver and
>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints fo=
r
>> uploading and downloading files, respectively. Security tokens are retur=
ned
>> in the responses for file uploading and must be carried in the requests =
for
>> downloading the files. The ResourceStagingServer is deployed as a
>> Kubernetes Service
>> <https://kubernetes.io/docs/concepts/services-networking/service/>
>> backed by a Deployment
>> <https://kubernetes.io/docs/concepts/workloads/controllers/deployment/>
>> in the cluster and multiple instances may be deployed in the same cluste=
r.
>> Spark applications specify which ResourceStagingServer instance to use
>> through a configuration property.
>>
>> The KubernetesExternalShuffleService is used to support dynamic resource
>> allocation, with which the number of executors of a Spark application ca=
n
>> change at runtime based on the resource needs. It provides an additional
>> endpoint for drivers that allows the shuffle service to delete driver
>> termination and clean up the shuffle files associated with corresponding
>> application. There are two ways of deploying the
>> KubernetesExternalShuffleService: running a shuffle service Pod on each
>> node in the cluster or a subset of the nodes using a DaemonSet
>> <https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/>,
>> or running a shuffle service container in each of the executor Pods. In =
the
>> first option, each shuffle service container mounts a hostPath
>> <https://kubernetes.io/docs/concepts/storage/volumes/#hostpath> volume.
>> The same hostPath volume is also mounted by each of the executor
>> containers, which must also have the environment variable
>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a shuffle
>> service container is co-located with an executor container in each of th=
e
>> executor Pods. The two containers share an emptyDir
>> <https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> volume
>> where the shuffle data gets written to. There may be multiple instances =
of
>> the shuffle service deployed in a cluster that may be used for different
>> versions of Spark, or for different priority levels with different resou=
rce
>> quotas.
>>
>> New Kubernetes-specific configuration options are also introduced to
>> facilitate specification and customization of driver and executor Pods a=
nd
>> related Kubernetes resources. For example, driver and executor Pods can =
be
>> created in a particular Kubernetes namespace and on a particular set of =
the
>> nodes in the cluster. Users are allowed to apply labels and annotations =
to
>> the driver and executor Pods.
>>
>> Additionally, secure HDFS support is being actively worked on following
>> the design here
>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_e=
wFuCNWKg/edit>.
>> Both short-running jobs and long-running jobs that need periodic delegat=
ion
>> token refresh are supported, leveraging built-in Kubernetes constructs l=
ike
>> Secrets. Please refer to the design doc for details.
>> Rejected DesignsResource Staging by the Driver
>>
>> A first implementation effectively included the ResourceStagingServer in
>> the driver container itself. The driver container ran a custom command t=
hat
>> opened an HTTP endpoint and waited for the submission client to send
>> resources to it. The server would then run the driver code after it had
>> received the resources from the submission client machine. The problem w=
ith
>> this approach is that the submission client needs to deploy the driver i=
n
>> such a way that the driver itself would be reachable from outside of the
>> cluster, but it is difficult for an automated framework which is not awa=
re
>> of the cluster's configuration to expose an arbitrary pod in a generic w=
ay.
>> The Service-based design chosen allows a cluster administrator to expose
>> the ResourceStagingServer in a manner that makes sense for their
>> cluster, such as with an Ingress or with a NodePort service.
>> Kubernetes External Shuffle Service
>>
>> Several alternatives were considered for the design of the shuffle
>> service. The first design postulated the use of long-lived executor pods
>> and sidecar containers in them running the shuffle service. The advantag=
e
>> of this model was that it would let us use emptyDir for sharing as oppos=
ed
>> to using node local storage, which guarantees better lifecycle managemen=
t
>> of storage by Kubernetes. The apparent disadvantage was that it would be=
 a
>> departure from the traditional Spark methodology of keeping executors fo=
r
>> only as long as required in dynamic allocation mode. It would additional=
ly
>> use up more resources than strictly necessary during the course of
>> long-running jobs, partially losing the advantage of dynamic scaling.
>>
>> Another alternative considered was to use a separate shuffle service
>> manager as a nameserver. This design has a few drawbacks. First, this me=
ans
>> another component that needs authentication/authorization management and
>> maintenance. Second, this separate component needs to be kept in sync wi=
th
>> the Kubernetes cluster. Last but not least, most of functionality of thi=
s
>> separate component can be performed by a combination of the in-cluster
>> shuffle service and the Kubernetes API server.
>> Pluggable Scheduler Backends
>>
>> Fully pluggable scheduler backends were considered as a more generalized
>> solution, and remain interesting as a possible avenue for future-proofin=
g
>> against new scheduling targets.  For the purposes of this project, addin=
g a
>> new specialized scheduler backend for Kubernetes was chosen as the appro=
ach
>> due to its very low impact on the core Spark code; making scheduler full=
y
>> pluggable would be a high-impact high-risk modification to Spark=E2=80=
=99s core
>> libraries. The pluggable scheduler backends effort is being tracked in
>> JIRA-19700 <https://issues.apache.org/jira/browse/SPARK-19700>.
>>
>>
>

--001a1142dfa414aeb60556ceb62f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><br></div>Kubernetes has evolved into an important co=
ntainer orchestration platform; it has a large and growing user base and an=
 active ecosystem.=C2=A0 Users of Apache Spark who are also deploying appli=
cations on Kubernetes (or are planning to) will have convergence-related mo=
tivations for migrating their Spark applications to Kubernetes as well. It =
avoids the need for deploying separate cluster infra for Spark workloads an=
d allows Spark applications to take full advantage of inhabiting the same o=
rchestration environment as other applications.=C2=A0 In this respect, nati=
ve Kubernetes support for Spark represents a way to optimize uptake and ret=
ention of Apache Spark among the members of the expanding Kubernetes commun=
ity.<br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On =
Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson <span dir=3D"ltr">&lt;<a href=
=3D"mailto:eerlands@redhat.com" target=3D"_blank">eerlands@redhat.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">+1 (non=
-binding)<div><div class=3D"h5"><br><div><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote">On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:foxish@google.com" target=3D"_blan=
k">foxish@google.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr"><span id=3D"m_4641530978377140058m_137737923653125097gma=
il-docs-internal-guid-01df7394-e67d-51c9-76d1-b89ba11594cf"><div>Spark on K=
ubernetes effort has been developed separately in a fork, and linked back f=
rom the Apache Spark project as=C2=A0<a href=3D"http://spark.apache.org/doc=
s/latest/cluster-overview.html#cluster-manager-types" target=3D"_blank">an =
experimental backend</a>. We&#39;re ~6 months in, have had=C2=A0<a href=3D"=
https://github.com/apache-spark-on-k8s/spark/releases" target=3D"_blank">5 =
releases</a>.=C2=A0<br></div><div><ul><li style=3D"margin-left:15px">2 Spar=
k versions maintained (2.1, and 2.2)</li><li style=3D"margin-left:15px">Ext=
ensive integration testing and refactoring efforts to maintain code quality=
</li><li style=3D"margin-left:15px"><a href=3D"https://github.com/apache-sp=
ark-on-k8s/spark#getting-started" target=3D"_blank">Developer</a>=C2=A0and=
=C2=A0<a href=3D"https://apache-spark-on-k8s.github.io/userdocs/" target=3D=
"_blank">user-facing</a>=C2=A0docu<wbr>mentation</li><li style=3D"margin-le=
ft:15px">10+ consistent code contributors from=C2=A0<a href=3D"https://apac=
he-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions" t=
arget=3D"_blank">different organizations</a>=C2=A0involved in actively main=
taining and using the project, with several more members involved in testin=
g and providing feedback.</li><li style=3D"margin-left:15px">The community =
has delivered several talks on Spark-on-Kubernetes generating lots of feedb=
ack from users.</li><li style=3D"margin-left:15px">In addition to these, we=
&#39;ve seen efforts spawn off such as:<br></li><ul><li style=3D"margin-lef=
t:15px"><a href=3D"https://github.com/apache-spark-on-k8s/kubernetes-HDFS" =
target=3D"_blank">HDFS on Kubernetes</a>=C2=A0with Locality and Performance=
 Experiments<br></li><li style=3D"margin-left:15px"><a href=3D"https://docs=
.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit" t=
arget=3D"_blank">Kerberized access</a>=C2=A0to HDFS from Spark running on K=
ubernetes</li></ul></ul></div><p dir=3D"ltr" style=3D"line-height:1.38;marg=
in-top:0pt;margin-bottom:3pt"><span style=3D"font-size:26pt;font-family:Ari=
al;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap"></span></p><div><div><b><span style=3D"font-size:12.8px=
">Following the=C2=A0</span><span class=3D"m_4641530978377140058m_137737923=
653125097gmail-m_4908977744468987281gmail-m_5091327022349668319gmail-il" st=
yle=3D"font-size:12.8px">SPIP</span><span style=3D"font-size:12.8px">=C2=A0=
process, I&#39;m putting this=C2=A0</span><span class=3D"m_4641530978377140=
058m_137737923653125097gmail-m_4908977744468987281gmail-m_50913270223496683=
19gmail-il" style=3D"font-size:12.8px">SPIP</span><span style=3D"font-size:=
12.8px">=C2=A0up for a vote.</span></b><br></div><span id=3D"m_464153097837=
7140058m_137737923653125097gmail-m_4908977744468987281gmail-m_5091327022349=
668319gmail-docs-internal-guid-5bf68d19-e227-2783-86db-9bf86db1abe7"><ul st=
yle=3D"font-size:12.8px"><li style=3D"margin-left:15px">+1: Yeah, let&#39;s=
 go forward and implement the SPIP.<br></li><li style=3D"margin-left:15px">=
+0: Don&#39;t really care.<br></li><li style=3D"margin-left:15px">-1: I don=
&#39;t think this is a good idea because of the following technical reasons=
.</li></ul><div style=3D"font-size:12.8px">If there is any further clarific=
ation desired, on the design or the implementation, please feel free to ask=
 questions or provide feedback.</div></span></div><p dir=3D"ltr" style=3D"l=
ine-height:1.38;margin-top:0pt;margin-bottom:3pt"><br></p><p dir=3D"ltr" st=
yle=3D"line-height:1.38;margin-top:0pt;margin-bottom:3pt"><span style=3D"fo=
nt-size:26pt;font-family:Arial;color:rgb(0,0,0);background-color:transparen=
t;vertical-align:baseline;white-space:pre-wrap">SPIP: Kubernetes as A Nativ=
e Cluster Manager</span></p><br><p dir=3D"ltr" style=3D"text-align:left;lin=
e-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11=
pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical=
-align:baseline;white-space:pre-wrap"><span style=3D"background-color:trans=
parent;font-size:11pt;vertical-align:baseline">Full Design Doc: </span><a h=
ref=3D"https://issues.apache.org/jira/secure/attachment/12881586/SPARK-1827=
8%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pd=
f" style=3D"font-family:arial,sans-serif;font-size:small;white-space:normal=
;text-decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt=
;font-family:Arial;background-color:transparent;text-decoration-line:underl=
ine;vertical-align:baseline;white-space:pre-wrap">link</span></a><br></span=
></p><p dir=3D"ltr" style=3D"text-align:left;line-height:1.38;margin-top:0p=
t;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:=
rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space=
:pre-wrap">JIRA: </span><a href=3D"https://issues.apache.org/jira/browse/SP=
ARK-18278" style=3D"text-decoration-line:none" target=3D"_blank"><span styl=
e=3D"font-size:11pt;font-family:Arial;background-color:transparent;text-dec=
oration-line:underline;vertical-align:baseline;white-space:pre-wrap">https:=
//issues.apache.org/jira<wbr>/browse/SPARK-18278</span></a></p><p dir=3D"lt=
r" style=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-bottom:0=
pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgr=
ound-color:transparent;vertical-align:baseline;white-space:pre-wrap">Kubern=
etes Issue: </span><span style=3D"text-decoration-line:underline;font-size:=
11pt;font-family:Arial;background-color:transparent;vertical-align:baseline=
;white-space:pre-wrap"><a href=3D"https://github.com/kubernetes/kubernetes/=
issues/34377" style=3D"text-decoration-line:none" target=3D"_blank">https:/=
/github.com/kubernetes/<wbr>kubernetes/issues/34377</a></span></p><br><p di=
r=3D"ltr" style=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-b=
ottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0)=
;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"=
>Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt Ch=
eah,</span></p><p dir=3D"ltr" style=3D"text-align:left;line-height:1.38;mar=
gin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Ar=
ial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;w=
hite-space:pre-wrap">Ilan Filonenko, Sean Suchter, Kimoon Kim</span></p><h1=
 dir=3D"ltr" style=3D"line-height:1.38;margin-top:20pt;margin-bottom:6pt"><=
span style=3D"font-size:20pt;font-family:Arial;color:rgb(0,0,0);background-=
color:transparent;font-weight:400;vertical-align:baseline;white-space:pre-w=
rap">Background and Motivation</span></h1><p dir=3D"ltr" style=3D"line-heig=
ht:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;fon=
t-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align=
:baseline;white-space:pre-wrap">Containerization and cluster management tec=
hnologies are constantly evolving in the cluster computing world. Apache Sp=
ark currently implements support for Apache Hadoop YARN and Apache Mesos, i=
n addition to providing its own standalone cluster manager. In 2014, Google=
 announced development of </span><a href=3D"https://kubernetes.io/" style=
=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"font-size:1=
1pt;font-family:Arial;background-color:transparent;text-decoration-line:und=
erline;vertical-align:baseline;white-space:pre-wrap">Kubernetes</span></a><=
span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap"> which has =
its own unique feature set and differentiates itself from YARN and Mesos. S=
ince its debut, it has seen contributions from over 1300 contributors with =
over 50000 commits. Kubernetes has cemented itself as a core player in the =
cluster computing world, and cloud-computing providers such as Google Conta=
iner Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azur=
e support running Kubernetes clusters.</span></p><br><p dir=3D"ltr" style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">This document outlines a propo=
sal for integrating Apache Spark with Kubernetes in a first class way, addi=
ng Kubernetes to the list of cluster managers that Spark can be used with. =
Doing so would allow users to share their computing resources and container=
ization framework between their existing applications on Kubernetes and the=
ir computational Spark applications. Although there is existing support for=
 </span><a href=3D"https://github.com/kubernetes/examples/blob/master/stagi=
ng/spark/README.md" style=3D"text-decoration-line:none" target=3D"_blank"><=
span style=3D"font-size:11pt;font-family:Arial;background-color:transparent=
;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wra=
p">running a Spark standalone cluster on Kubernetes</span></a><span style=
=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">, there are still maj=
or advantages and significant interest in having native execution support. =
For example, this integration provides better support for multi-tenancy and=
 dynamic resource allocation. It also allows users to run applications of d=
ifferent Spark versions of their choices in the same cluster. </span></p><b=
r><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt=
"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgrou=
nd-color:transparent;vertical-align:baseline;white-space:pre-wrap">The feat=
ure is being developed in a </span><a href=3D"https://github.com/apache-spa=
rk-on-k8s/spark" style=3D"text-decoration-line:none" target=3D"_blank"><spa=
n style=3D"font-size:11pt;font-family:Arial;background-color:transparent;te=
xt-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">=
separate fork</span></a><span style=3D"font-size:11pt;font-family:Arial;col=
or:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-sp=
ace:pre-wrap"> in order to minimize risk to the main project during develop=
ment. Since the start of the development in November of 2016, it has receiv=
ed over 100 commits from over 20 contributors and supports two releases bas=
ed on Spark 2.1 and 2.2 respectively. Documentation is also being actively =
worked on both in the main project repository and also in the repository </=
span><a href=3D"https://github.com/apache-spark-on-k8s/userdocs" style=3D"t=
ext-decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;f=
ont-family:Arial;background-color:transparent;text-decoration-line:underlin=
e;vertical-align:baseline;white-space:pre-wrap">https://github.com/apache-s=
par<wbr>k-on-k8s/userdocs</span></a><span style=3D"font-size:11pt;font-fami=
ly:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:basel=
ine;white-space:pre-wrap">. Regarding real-world use cases, we have seen cl=
uster setup that uses 1000+ cores. We are also seeing growing interests on =
this project from more and more organizations.</span></p><br><p dir=3D"ltr"=
 style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D=
"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa=
rent;vertical-align:baseline;white-space:pre-wrap">While it is easy to boot=
strap the project in a forked repository, it is hard to maintain it in the =
long run because of the tricky process of rebasing onto the upstream and la=
ck of awareness in the large Spark community. It would be beneficial to bot=
h the Spark and Kubernetes community seeing this feature being merged upstr=
eam. On one hand, it gives Spark users the option of running their Spark wo=
rkloads along with other workloads that may already be running on Kubernete=
s, enabling better resource sharing and isolation, and better cluster admin=
istration. On the other hand, it gives Kubernetes a leap forward in the are=
a of large-scale data processing by being an officially supported cluster m=
anager for Spark. The risk of merging into upstream is low because most of =
the changes are purely incremental, i.e., new Kubernetes-aware implementati=
ons of existing interfaces/classes in Spark core are introduced. The develo=
pment is also concentrated in a single place at </span><a href=3D"https://g=
ithub.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-man=
agers/kubernetes" style=3D"text-decoration-line:none" target=3D"_blank"><sp=
an style=3D"font-size:11pt;font-family:Arial;background-color:transparent;t=
ext-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap"=
>resource-managers/kubernetes</span></a><span style=3D"font-size:11pt;font-=
family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline;white-space:pre-wrap">. The risk is further reduced by a comprehens=
ive integration test framework, and an active and responsive community of f=
uture maintainers.</span></p><h1 dir=3D"ltr" style=3D"line-height:1.38;marg=
in-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;font-family:Ar=
ial;color:rgb(0,0,0);background-color:transparent;font-weight:400;vertical-=
align:baseline;white-space:pre-wrap">Target Personas</span></h1><p dir=3D"l=
tr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=
=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">Devops, data scientis=
ts, data engineers, application developers, anyone who can benefit from hav=
ing </span><a href=3D"https://kubernetes.io/docs/concepts/overview/what-is-=
kubernetes/" style=3D"text-decoration-line:none" target=3D"_blank"><span st=
yle=3D"font-size:11pt;font-family:Arial;background-color:transparent;text-d=
ecoration-line:underline;vertical-align:baseline;white-space:pre-wrap">Kube=
rnetes</span></a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(=
0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre=
-wrap"> as a native cluster manager for Spark. </span></p><h1 dir=3D"ltr" s=
tyle=3D"line-height:1.38;margin-top:20pt;margin-bottom:6pt"><span style=3D"=
font-size:20pt;font-family:Arial;color:rgb(0,0,0);background-color:transpar=
ent;font-weight:400;vertical-align:baseline;white-space:pre-wrap">Goals</sp=
an></h1><ul style=3D"margin-top:0pt;margin-bottom:0pt"><li dir=3D"ltr" styl=
e=3D"list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0)=
;background-color:transparent;vertical-align:baseline"><p dir=3D"ltr" style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;background-color:transparent;vertical-align:baseline;white-space:=
pre-wrap">Make Kubernetes a first-class cluster manager for Spark, alongsid=
e Spark Standalone, Yarn, and Mesos.</span></p></li><li dir=3D"ltr" style=
=3D"list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);=
background-color:transparent;vertical-align:baseline"><p dir=3D"ltr" style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;background-color:transparent;vertical-align:baseline;white-space:=
pre-wrap">Support both client and cluster deployment mode.</span></p></li><=
li dir=3D"ltr" style=3D"list-style-type:disc;font-size:11pt;font-family:Ari=
al;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline"><=
p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><=
span style=3D"font-size:11pt;background-color:transparent;vertical-align:ba=
seline;white-space:pre-wrap">Support </span><a href=3D"http://spark.apache.=
org/docs/latest/job-scheduling.html#dynamic-resource-allocation" style=3D"t=
ext-decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;b=
ackground-color:transparent;text-decoration-line:underline;vertical-align:b=
aseline;white-space:pre-wrap">dynamic resource allocation</span></a><span s=
tyle=3D"font-size:11pt;background-color:transparent;vertical-align:baseline=
;white-space:pre-wrap">.</span></p></li><li dir=3D"ltr" style=3D"list-style=
-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-col=
or:transparent;vertical-align:baseline"><p dir=3D"ltr" style=3D"line-height=
:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;backg=
round-color:transparent;vertical-align:baseline;white-space:pre-wrap">Suppo=
rt Spark Java/Scala, PySpark, and Spark R applications.</span></p></li><li =
dir=3D"ltr" style=3D"list-style-type:disc;font-size:11pt;font-family:Arial;=
color:rgb(0,0,0);background-color:transparent;vertical-align:baseline"><p d=
ir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><spa=
n style=3D"font-size:11pt;background-color:transparent;vertical-align:basel=
ine;white-space:pre-wrap">Support secure HDFS access.</span></p></li><li di=
r=3D"ltr" style=3D"list-style-type:disc;font-size:11pt;font-family:Arial;co=
lor:rgb(0,0,0);background-color:transparent;vertical-align:baseline"><p dir=
=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span =
style=3D"font-size:11pt;background-color:transparent;vertical-align:baselin=
e;white-space:pre-wrap">Allow running applications of different Spark versi=
ons in the same cluster through the ability to specify the driver and execu=
tor Docker images on a per-application basis.</span></p></li><li dir=3D"ltr=
" style=3D"list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(=
0,0,0);background-color:transparent;vertical-align:baseline"><p dir=3D"ltr"=
 style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D=
"font-size:11pt;background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">Support specification and enforcement of limits on both CPU=
 cores and memory.</span></p></li></ul><h1 dir=3D"ltr" style=3D"line-height=
:1.38;margin-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;font=
-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400=
;vertical-align:baseline;white-space:pre-wrap">Non-Goals</span></h1><ul sty=
le=3D"margin-top:0pt;margin-bottom:0pt"><li dir=3D"ltr" style=3D"list-style=
-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-col=
or:transparent;vertical-align:baseline"><p dir=3D"ltr" style=3D"line-height=
:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;backg=
round-color:transparent;vertical-align:baseline;white-space:pre-wrap">Suppo=
rt cluster resource scheduling and sharing beyond capabilities offered nati=
vely by the Kubernetes per-namespace resource quota model.</span></p></li><=
/ul><h1 dir=3D"ltr" style=3D"line-height:1.38;margin-top:20pt;margin-bottom=
:6pt"><span style=3D"font-size:20pt;font-family:Arial;color:rgb(0,0,0);back=
ground-color:transparent;font-weight:400;vertical-align:baseline;white-spac=
e:pre-wrap">Proposed API Changes</span></h1><p dir=3D"ltr" style=3D"line-he=
ight:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;f=
ont-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-ali=
gn:baseline;white-space:pre-wrap">Most API changes are purely incremental, =
i.e., new Kubernetes-aware implementations of existing interfaces/classes i=
n Spark core are introduced. Detailed changes are as follows.</span></p><ul=
 style=3D"margin-top:0pt;margin-bottom:0pt"><li dir=3D"ltr" style=3D"list-s=
tyle-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background=
-color:transparent;vertical-align:baseline"><p dir=3D"ltr" style=3D"line-he=
ight:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap">A=
 new cluster manager option </span><span style=3D"font-size:11pt;font-famil=
y:&quot;Courier New&quot;;background-color:transparent;vertical-align:basel=
ine;white-space:pre-wrap">KUBERNETES</span><span style=3D"font-size:11pt;ba=
ckground-color:transparent;vertical-align:baseline;white-space:pre-wrap"> i=
s introduced and some changes are made to </span><span style=3D"font-size:1=
1pt;font-family:&quot;Courier New&quot;;background-color:transparent;vertic=
al-align:baseline;white-space:pre-wrap">SparkSubmit</span><span style=3D"fo=
nt-size:11pt;background-color:transparent;vertical-align:baseline;white-spa=
ce:pre-wrap"> to make it be aware of this option. </span></p></li><li dir=
=3D"ltr" style=3D"list-style-type:disc;font-size:11pt;font-family:Arial;col=
or:rgb(0,0,0);background-color:transparent;vertical-align:baseline"><p dir=
=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span =
style=3D"font-size:11pt;background-color:transparent;vertical-align:baselin=
e;white-space:pre-wrap">A new implementation of </span><span style=3D"font-=
size:11pt;font-family:&quot;Courier New&quot;;background-color:transparent;=
vertical-align:baseline;white-space:pre-wrap">CoarseGrainedSchedulerBackend=
</span><span style=3D"font-size:11pt;background-color:transparent;vertical-=
align:baseline;white-space:pre-wrap">, namely </span><span style=3D"font-si=
ze:11pt;font-family:&quot;Courier New&quot;;background-color:transparent;ve=
rtical-align:baseline;white-space:pre-wrap">KubernetesClusterSchedulerBack<=
wbr>end</span><span style=3D"font-size:11pt;background-color:transparent;ve=
rtical-align:baseline;white-space:pre-wrap"> is responsible for managing th=
e creation and deletion of executor Pods through the Kubernetes API.</span>=
</p></li><li dir=3D"ltr" style=3D"list-style-type:disc;font-size:11pt;font-=
family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline"><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bot=
tom:0pt"><span style=3D"font-size:11pt;background-color:transparent;vertica=
l-align:baseline;white-space:pre-wrap">A new implementation of </span><span=
 style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;background-col=
or:transparent;vertical-align:baseline;white-space:pre-wrap">TaskSchedulerI=
mpl</span><span style=3D"font-size:11pt;background-color:transparent;vertic=
al-align:baseline;white-space:pre-wrap">, namely </span><span style=3D"font=
-size:11pt;font-family:&quot;Courier New&quot;;background-color:transparent=
;vertical-align:baseline;white-space:pre-wrap">KubernetesTaskSchedulerImpl<=
/span><span style=3D"font-size:11pt;background-color:transparent;vertical-a=
lign:baseline;white-space:pre-wrap">, and a new implementation of </span><s=
pan style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap">TaskSetMana=
ger</span><span style=3D"font-size:11pt;background-color:transparent;vertic=
al-align:baseline;white-space:pre-wrap">, namely </span><span style=3D"font=
-size:11pt;font-family:&quot;Courier New&quot;;background-color:transparent=
;vertical-align:baseline;white-space:pre-wrap">Kubernetes TaskSetManager</s=
pan><span style=3D"font-size:11pt;background-color:transparent;vertical-ali=
gn:baseline;white-space:pre-wrap">, are introduced for Kubernetes-aware tas=
k scheduling.</span></p></li><li dir=3D"ltr" style=3D"list-style-type:disc;=
font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpar=
ent;vertical-align:baseline"><p dir=3D"ltr" style=3D"line-height:1.38;margi=
n-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;background-color=
:transparent;vertical-align:baseline;white-space:pre-wrap">When dynamic res=
ource allocation is enabled, a new implementation of </span><span style=3D"=
font-size:11pt;font-family:&quot;Courier New&quot;;background-color:transpa=
rent;vertical-align:baseline;white-space:pre-wrap">ExternalShuffleService</=
span><span style=3D"font-size:11pt;background-color:transparent;vertical-al=
ign:baseline;white-space:pre-wrap">, namely </span><span style=3D"font-size=
:11pt;font-family:&quot;Courier New&quot;;background-color:transparent;vert=
ical-align:baseline;white-space:pre-wrap">KubernetesExternalShuffleServi<wb=
r>ce</span><span style=3D"font-size:11pt;background-color:transparent;verti=
cal-align:baseline;white-space:pre-wrap"> is introduced.</span></p></li></u=
l><h1 dir=3D"ltr" style=3D"line-height:1.38;margin-top:20pt;margin-bottom:6=
pt"><span style=3D"font-size:20pt;font-family:Arial;color:rgb(0,0,0);backgr=
ound-color:transparent;font-weight:400;vertical-align:baseline;white-space:=
pre-wrap">Design Sketch</span></h1><p dir=3D"ltr" style=3D"line-height:1.38=
;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-famil=
y:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseli=
ne;white-space:pre-wrap">Below we briefly describe the design. For more det=
ails on the design and architecture, please refer to the architecture </spa=
n><a href=3D"https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-k=
ubernetes/resource-managers/kubernetes/architecture-docs" style=3D"text-dec=
oration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;font-fam=
ily:Arial;background-color:transparent;text-decoration-line:underline;verti=
cal-align:baseline;white-space:pre-wrap">documentation</span></a><span styl=
e=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tra=
nsparent;vertical-align:baseline;white-space:pre-wrap">. The main idea of t=
his design is to run Spark driver and executors inside Kubernetes </span><a=
 href=3D"https://kubernetes.io/docs/concepts/workloads/pods/pod/" style=3D"=
text-decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;=
font-family:Arial;background-color:transparent;text-decoration-line:underli=
ne;vertical-align:baseline;white-space:pre-wrap">Pods</span></a><span style=
=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">. Pods are a co-locat=
ed and co-scheduled group of one or more containers run in a shared context=
. The driver is responsible for creating and destroying executor Pods throu=
gh the Kubernetes API, while Kubernetes is fully responsible for scheduling=
 the Pods to run on available nodes in the cluster. In the cluster mode, th=
e driver also runs in a Pod in the cluster, created through the Kubernetes =
API by a Kubernetes-aware submission client called by the </span><span styl=
e=3D"font-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);ba=
ckground-color:transparent;vertical-align:baseline;white-space:pre-wrap">sp=
ark-submit</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(=
0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre=
-wrap"> script. Because the driver runs in a Pod, it is reachable by the ex=
ecutors in the cluster using its Pod IP. In the client mode, the driver run=
s outside the cluster and calls the Kubernetes API to create and destroy ex=
ecutor Pods. The driver must be routable from within the cluster for the ex=
ecutors to communicate with it. </span></p><br><p dir=3D"ltr" style=3D"line=
-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11p=
t;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-=
align:baseline;white-space:pre-wrap">The main component running in the driv=
er is the </span><span style=3D"font-size:11pt;font-family:&quot;Courier Ne=
w&quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseli=
ne;white-space:pre-wrap">KubernetesClusterSchedulerBack<wbr>end</span><span=
 style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-colo=
r:transparent;vertical-align:baseline;white-space:pre-wrap">, an implementa=
tion of </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&=
quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline=
;white-space:pre-wrap">CoarseGrainedSchedulerBackend</span><span style=3D"f=
ont-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpare=
nt;vertical-align:baseline;white-space:pre-wrap">, which manages allocating=
 and destroying executors via the Kubernetes API, as instructed by Spark co=
re via calls to methods </span><span style=3D"font-size:11pt;font-family:&q=
uot;Courier New&quot;;color:rgb(0,0,0);background-color:transparent;vertica=
l-align:baseline;white-space:pre-wrap">doRequestTotalExecutors</span><span =
style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color=
:transparent;vertical-align:baseline;white-space:pre-wrap"> and </span><spa=
n style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0=
,0);background-color:transparent;vertical-align:baseline;white-space:pre-wr=
ap">doKillExecutors</span><span style=3D"font-size:11pt;font-family:Arial;c=
olor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">, respectively. Within the </span><span style=3D"font-size:=
11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:=
transparent;vertical-align:baseline;white-space:pre-wrap">KubernetesCluster=
SchedulerBack<wbr>end</span><span style=3D"font-size:11pt;font-family:Arial=
;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;whit=
e-space:pre-wrap">, a separate </span><span style=3D"font-size:11pt;font-fa=
mily:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:transparent;=
vertical-align:baseline;white-space:pre-wrap">kubernetes-pod-allocator</spa=
n><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgrou=
nd-color:transparent;vertical-align:baseline;white-space:pre-wrap"> thread =
handles the creation of new executor Pods with appropriate throttling and m=
onitoring. Throttling is achieved using a feedback loop that makes decision=
 on submitting new requests for executors based on whether previous executo=
r Pod creation requests have completed. This indirection is necessary becau=
se the Kubernetes API server accepts requests for new Pods optimistically, =
with the anticipation of being able to eventually schedule them to run. How=
ever, it is undesirable to have a very large number of Pods that cannot be =
scheduled and stay pending within the cluster. The throttling mechanism giv=
es us control over how fast an application scales up (which can be configur=
ed), and helps prevent Spark applications from DOS-ing the Kubernetes API s=
erver with too many Pod creation requests. The executor Pods simply run the=
 </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;c=
olor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">CoarseGrainedExecutorBackend</span><span style=3D"font-size=
:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;verti=
cal-align:baseline;white-space:pre-wrap"> class from a pre-built Docker ima=
ge that contains a Spark distribution. </span></p><br><p dir=3D"ltr" style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">There are auxiliary and option=
al components: </span><span style=3D"font-size:11pt;font-family:&quot;Couri=
er New&quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline;white-space:pre-wrap">ResourceStagingServer</span><span style=3D"fo=
nt-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparen=
t;vertical-align:baseline;white-space:pre-wrap"> and </span><span style=3D"=
font-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);backgro=
und-color:transparent;vertical-align:baseline;white-space:pre-wrap">Kuberne=
tesExternalShuffleServi<wbr>ce</span><span style=3D"font-size:11pt;font-fam=
ily:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:base=
line;white-space:pre-wrap">, which serve specific purposes described below.=
 The </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quo=
t;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap">ResourceStagingServer</span><span style=3D"font-size:11=
pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical=
-align:baseline;white-space:pre-wrap"> serves as a file store (in the absen=
ce of a persistent storage layer in Kubernetes) for application dependencie=
s uploaded from the submission client machine, which then get downloaded fr=
om the server by the init-containers in the driver and executor Pods. It is=
 a Jetty server with JAX-RS and has two endpoints for uploading and downloa=
ding files, respectively. Security tokens are returned in the responses for=
 file uploading and must be carried in the requests for downloading the fil=
es. The </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&=
quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline=
;white-space:pre-wrap">ResourceStagingServer</span><span style=3D"font-size=
:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;verti=
cal-align:baseline;white-space:pre-wrap"> is deployed as a Kubernetes </spa=
n><a href=3D"https://kubernetes.io/docs/concepts/services-networking/servic=
e/" style=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"fo=
nt-size:11pt;font-family:Arial;background-color:transparent;text-decoration=
-line:underline;vertical-align:baseline;white-space:pre-wrap">Service</span=
></a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backg=
round-color:transparent;vertical-align:baseline;white-space:pre-wrap"> back=
ed by a </span><a href=3D"https://kubernetes.io/docs/concepts/workloads/con=
trollers/deployment/" style=3D"text-decoration-line:none" target=3D"_blank"=
><span style=3D"font-size:11pt;font-family:Arial;background-color:transpare=
nt;text-decoration-line:underline;vertical-align:baseline;white-space:pre-w=
rap">Deployment</span></a><span style=3D"font-size:11pt;font-family:Arial;c=
olor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap"> in the cluster and multiple instances may be deployed in t=
he same cluster. Spark applications specify which </span><span style=3D"fon=
t-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background=
-color:transparent;vertical-align:baseline;white-space:pre-wrap">ResourceSt=
agingServer</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb=
(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pr=
e-wrap"> instance to use through a configuration property.</span></p><br><p=
 dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><s=
pan style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-c=
olor:transparent;vertical-align:baseline;white-space:pre-wrap">The </span><=
span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(=
0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre=
-wrap">KubernetesExternalShuffleServi<wbr>ce</span><span style=3D"font-size=
:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;verti=
cal-align:baseline;white-space:pre-wrap"> is used to support dynamic resour=
ce allocation, with which the number of executors of a Spark application ca=
n change at runtime based on the resource needs. It provides an additional =
endpoint for drivers that allows the shuffle service to delete driver termi=
nation and clean up the shuffle files associated with corresponding applica=
tion. There are two ways of deploying the </span><span style=3D"font-size:1=
1pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">KubernetesExternal=
ShuffleServi<wbr>ce</span><span style=3D"font-size:11pt;font-family:Arial;c=
olor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">: running a shuffle service Pod on each node in the cluster=
 or a subset of the nodes using a </span><a href=3D"https://kubernetes.io/d=
ocs/concepts/workloads/controllers/daemonset/" style=3D"text-decoration-lin=
e:none" target=3D"_blank"><span style=3D"font-size:11pt;font-family:Arial;b=
ackground-color:transparent;text-decoration-line:underline;vertical-align:b=
aseline;white-space:pre-wrap">DaemonSet</span></a><span style=3D"font-size:=
11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertic=
al-align:baseline;white-space:pre-wrap">, or running a shuffle service cont=
ainer in each of the executor Pods. In the first option, each shuffle servi=
ce container mounts a </span><a href=3D"https://kubernetes.io/docs/concepts=
/storage/volumes/#hostpath" style=3D"text-decoration-line:none" target=3D"_=
blank"><span style=3D"font-size:11pt;font-family:Arial;background-color:tra=
nsparent;text-decoration-line:underline;vertical-align:baseline;white-space=
:pre-wrap">hostPath</span></a><span style=3D"font-size:11pt;font-family:Ari=
al;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap"> volume. The same hostPath volume is also mounted by ea=
ch of the executor containers, which must also have the environment variabl=
e </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;=
color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white=
-space:pre-wrap">SPARK_LOCAL_DIRS</span><span style=3D"font-size:11pt;font-=
family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline;white-space:pre-wrap"> point to the hostPath. In the second option,=
 a shuffle service container is co-located with an executor container in ea=
ch of the executor Pods. The two containers share an </span><a href=3D"http=
s://kubernetes.io/docs/concepts/storage/volumes/#emptydir" style=3D"text-de=
coration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;font-fa=
mily:Arial;background-color:transparent;text-decoration-line:underline;vert=
ical-align:baseline;white-space:pre-wrap">emptyDir</span></a><span style=3D=
"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa=
rent;vertical-align:baseline;white-space:pre-wrap"> volume where the shuffl=
e data gets written to. There may be multiple instances of the shuffle serv=
ice deployed in a cluster that may be used for different versions of Spark,=
 or for different priority levels with different resource quotas.</span></p=
><br><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:=
0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backg=
round-color:transparent;vertical-align:baseline;white-space:pre-wrap">New K=
ubernetes-specific configuration options are also introduced to facilitate =
specification and customization of driver and executor Pods and related Kub=
ernetes resources. For example, driver and executor Pods can be created in =
a particular Kubernetes namespace and on a particular set of the nodes in t=
he cluster. Users are allowed to apply labels and annotations to the driver=
 and executor Pods.</span></p><br><p dir=3D"ltr" style=3D"line-height:1.38;=
margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family=
:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baselin=
e;white-space:pre-wrap">Additionally, secure HDFS support is being actively=
 worked on following the design </span><a href=3D"https://docs.google.com/d=
ocument/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit" style=3D"text-=
decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;font-=
family:Arial;background-color:transparent;text-decoration-line:underline;ve=
rtical-align:baseline;white-space:pre-wrap">here</span></a><span style=3D"f=
ont-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpare=
nt;vertical-align:baseline;white-space:pre-wrap">. Both short-running jobs =
and long-running jobs that need periodic delegation token refresh are suppo=
rted, leveraging built-in Kubernetes constructs like Secrets. Please refer =
to the design doc for details. </span></p><h1 dir=3D"ltr" style=3D"line-hei=
ght:1.38;margin-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;f=
ont-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:=
400;vertical-align:baseline;white-space:pre-wrap">Rejected Designs</span></=
h1><h2 dir=3D"ltr" style=3D"line-height:1.38;margin-top:18pt;margin-bottom:=
6pt"><span style=3D"font-size:16pt;font-family:Arial;color:rgb(0,0,0);backg=
round-color:transparent;font-weight:400;vertical-align:baseline;white-space=
:pre-wrap">Resource Staging by the Driver</span></h2><p dir=3D"ltr" style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">A first implementation effecti=
vely included the </span><span style=3D"font-size:11pt;font-family:&quot;Co=
urier New&quot;;color:rgb(0,0,0);background-color:transparent;vertical-alig=
n:baseline;white-space:pre-wrap">ResourceStagingServer</span><span style=3D=
"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa=
rent;vertical-align:baseline;white-space:pre-wrap"> in the driver container=
 itself. The driver container ran a custom command that opened an HTTP endp=
oint and waited for the submission client to send resources to it. The serv=
er would then run the driver code after it had received the resources from =
the submission client machine. The problem with this approach is that the s=
ubmission client needs to deploy the driver in such a way that the driver i=
tself would be reachable from outside of the cluster, but it is difficult f=
or an automated framework which is not aware of the cluster&#39;s configura=
tion to expose an arbitrary pod in a generic way. The Service-based design =
chosen allows a cluster administrator to expose the </span><span style=3D"f=
ont-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);backgrou=
nd-color:transparent;vertical-align:baseline;white-space:pre-wrap">Resource=
StagingServer</span><span style=3D"font-size:11pt;font-family:Arial;color:r=
gb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:=
pre-wrap"> in a manner that makes sense for their cluster, such as with an =
Ingress or with a NodePort service.</span></p><h2 dir=3D"ltr" style=3D"line=
-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style=3D"font-size:16=
pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-wei=
ght:400;vertical-align:baseline;white-space:pre-wrap">Kubernetes External S=
huffle Service</span></h2><p dir=3D"ltr" style=3D"line-height:1.38;margin-t=
op:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;c=
olor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">Several alternatives were considered for the design of the =
shuffle service. The first design postulated the use of long-lived executor=
 pods and sidecar containers in them running the shuffle service. The advan=
tage of this model was that it would let us use emptyDir for sharing as opp=
osed to using node local storage, which guarantees better lifecycle managem=
ent of storage by Kubernetes. The apparent disadvantage was that it would b=
e a departure from the traditional Spark methodology of keeping executors f=
or only as long as required in dynamic allocation mode. It would additional=
ly use up more resources than strictly necessary during the course of long-=
running jobs, partially losing the advantage of dynamic scaling.</span></p>=
<br><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0=
pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgr=
ound-color:transparent;vertical-align:baseline;white-space:pre-wrap">Anothe=
r alternative considered was to use a separate shuffle service manager as a=
 nameserver. This design has a few drawbacks. First, this means another com=
ponent that needs authentication/authorization management and maintenance. =
Second, this separate component needs to be kept in sync with the Kubernete=
s cluster. Last but not least, most of functionality of this separate compo=
nent can be performed by a combination of the in-cluster shuffle service an=
d the Kubernetes API server.</span></p><h2 dir=3D"ltr" style=3D"line-height=
:1.38;margin-top:18pt;margin-bottom:6pt"><span style=3D"font-size:16pt;font=
-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400=
;vertical-align:baseline;white-space:pre-wrap">Pluggable Scheduler Backends=
</span></h2><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-=
bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0=
);background-color:transparent;vertical-align:baseline;white-space:pre-wrap=
">Fully pluggable scheduler backends were considered as a more generalized =
solution, and remain interesting as a possible avenue for future-proofing a=
gainst new scheduling targets.=C2=A0 For the purposes of this project, addi=
ng a new specialized scheduler backend for Kubernetes was chosen as the app=
roach due to its very low impact on the core Spark code; making scheduler f=
ully pluggable would be a high-impact high-risk modification to Spark=E2=80=
=99s core libraries. The pluggable scheduler backends effort is being track=
ed in </span><a href=3D"https://issues.apache.org/jira/browse/SPARK-19700" =
style=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"font-s=
ize:11pt;font-family:Arial;background-color:transparent;text-decoration-lin=
e:underline;vertical-align:baseline;white-space:pre-wrap">JIRA-19700</span>=
</a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgr=
ound-color:transparent;vertical-align:baseline;white-space:pre-wrap">.</spa=
n></p><div><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0)=
;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"=
><br></span></div></span></div>
</blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>

--001a1142dfa414aeb60556ceb62f--