Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
References: <CADDfBqAMQOigu3LpnX8otwpohpaFcWmax7No0tYqRBMLNfgMNw@mail.gmail.com>
 <CAOAzsefjFx1D5ZUANwz25i5BFsxoCGmfQBkJdiiC4q7RqmV1xA@mail.gmail.com>
In-Reply-To: <CAOAzsefjFx1D5ZUANwz25i5BFsxoCGmfQBkJdiiC4q7RqmV1xA@mail.gmail.com>
From: Holden Karau <holden@pigscanfly.ca>
Date: Tue, 15 Aug 2017 17:09:32 +0000
Message-ID: <CAJLcJd8KXSup2q0Gz+Xbz+zsExvKWgnk3xrdKAsfJ0DTEkKOgw@mail.gmail.com>
Subject: Re: SPIP: Spark on Kubernetes
To: William Benton <willb@redhat.com>, dev <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c14c416735f8b0556cdd8c1"
archived-at: Tue, 15 Aug 2017 17:10:03 -0000

--94eb2c14c416735f8b0556cdd8c1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

+1 (non-binding)

I (personally) think that Kubernetes as a scheduler backend should
eventually get merged in and there is clearly a community interested in the
work required to maintain it.

On Tue, Aug 15, 2017 at 9:51 AM William Benton <willb@redhat.com> wrote:

> +1 (non-binding)
>
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
> foxish@google.com.invalid> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manag=
er-types>.
>> We're ~6 months in, have had 5 releases
>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>
>>    - 2 Spark versions maintained (2.1, and 2.2)
>>    - Extensive integration testing and refactoring efforts to maintain
>>    code quality
>>    - Developer
>>    <https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>    user-facing <https://apache-spark-on-k8s.github.io/userdocs/>
>>     documentation
>>    - 10+ consistent code contributors from different organizations
>>    <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#proje=
ct-contributions> involved
>>    in actively maintaining and using the project, with several more memb=
ers
>>    involved in testing and providing feedback.
>>    - The community has delivered several talks on Spark-on-Kubernetes
>>    generating lots of feedback from users.
>>    - In addition to these, we've seen efforts spawn off such as:
>>    - HDFS on Kubernetes
>>       <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>       Locality and Performance Experiments
>>       - Kerberized access
>>       <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1=
RwpU_ewFuCNWKg/edit> to
>>       HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>    - +1: Yeah, let's go forward and implement the SPIP.
>>    - +0: Don't really care.
>>    - -1: I don't think this is a good idea because of the following
>>    technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%2=
0Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implemen=
ts
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providin=
g
>> its own standalone cluster manager. In 2014, Google announced developmen=
t
>> of Kubernetes <https://kubernetes.io/> which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it h=
as
>> seen contributions from over 1300 contributors with over 50000 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluste=
r
>> managers that Spark can be used with. Doing so would allow users to shar=
e
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README=
.md>,
>> there are still major advantages and significant interest in having nati=
ve
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users =
to
>> run applications of different Spark versions of their choices in the sam=
e
>> cluster.
>>
>> The feature is being developed in a separate fork
>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize risk
>> to the main project during development. Since the start of the developme=
nt
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world
>> use cases, we have seen cluster setup that uses 1000+ cores. We are also
>> seeing growing interests on this project from more and more organization=
s.
>>
>> While it is easy to bootstrap the project in a forked repository, it is
>> hard to maintain it in the long run because of the tricky process of
>> rebasing onto the upstream and lack of awareness in the large Spark
>> community. It would be beneficial to both the Spark and Kubernetes
>> community seeing this feature being merged upstream. On one hand, it giv=
es
>> Spark users the option of running their Spark workloads along with other
>> workloads that may already be running on Kubernetes, enabling better
>> resource sharing and isolation, and better cluster administration. On th=
e
>> other hand, it gives Kubernetes a leap forward in the area of large-scal=
e
>> data processing by being an officially supported cluster manager for Spa=
rk.
>> The risk of merging into upstream is low because most of the changes are
>> purely incremental, i.e., new Kubernetes-aware implementations of existi=
ng
>> interfaces/classes in Spark core are introduced. The development is also
>> concentrated in a single place at resource-managers/kubernetes
>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes=
/resource-managers/kubernetes>.
>> The risk is further reduced by a comprehensive integration test framewor=
k,
>> and an active and responsive community of future maintainers.
>> Target Personas
>>
>> Devops, data scientists, data engineers, application developers, anyone
>> who can benefit from having Kubernetes
>> <https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/> as a
>> native cluster manager for Spark.
>> Goals
>>
>>    -
>>
>>    Make Kubernetes a first-class cluster manager for Spark, alongside
>>    Spark Standalone, Yarn, and Mesos.
>>    -
>>
>>    Support both client and cluster deployment mode.
>>    -
>>
>>    Support dynamic resource allocation
>>    <http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-reso=
urce-allocation>
>>    .
>>    -
>>
>>    Support Spark Java/Scala, PySpark, and Spark R applications.
>>    -
>>
>>    Support secure HDFS access.
>>    -
>>
>>    Allow running applications of different Spark versions in the same
>>    cluster through the ability to specify the driver and executor Docker
>>    images on a per-application basis.
>>    -
>>
>>    Support specification and enforcement of limits on both CPU cores and
>>    memory.
>>
>> Non-Goals
>>
>>    -
>>
>>    Support cluster resource scheduling and sharing beyond capabilities
>>    offered natively by the Kubernetes per-namespace resource quota model=
.
>>
>> Proposed API Changes
>>
>> Most API changes are purely incremental, i.e., new Kubernetes-aware
>> implementations of existing interfaces/classes in Spark core are
>> introduced. Detailed changes are as follows.
>>
>>    -
>>
>>    A new cluster manager option KUBERNETES is introduced and some
>>    changes are made to SparkSubmit to make it be aware of this option.
>>    -
>>
>>    A new implementation of CoarseGrainedSchedulerBackend, namely
>>    KubernetesClusterSchedulerBackend is responsible for managing the
>>    creation and deletion of executor Pods through the Kubernetes API.
>>    -
>>
>>    A new implementation of TaskSchedulerImpl, namely
>>    KubernetesTaskSchedulerImpl, and a new implementation of
>>    TaskSetManager, namely Kubernetes TaskSetManager, are introduced for
>>    Kubernetes-aware task scheduling.
>>    -
>>
>>    When dynamic resource allocation is enabled, a new implementation of
>>    ExternalShuffleService, namely KubernetesExternalShuffleService is
>>    introduced.
>>
>> Design Sketch
>>
>> Below we briefly describe the design. For more details on the design and
>> architecture, please refer to the architecture documentation
>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes=
/resource-managers/kubernetes/architecture-docs>.
>> The main idea of this design is to run Spark driver and executors inside
>> Kubernetes Pods <https://kubernetes.io/docs/concepts/workloads/pods/pod/=
>.
>> Pods are a co-located and co-scheduled group of one or more containers r=
un
>> in a shared context. The driver is responsible for creating and destroyi=
ng
>> executor Pods through the Kubernetes API, while Kubernetes is fully
>> responsible for scheduling the Pods to run on available nodes in the
>> cluster. In the cluster mode, the driver also runs in a Pod in the clust=
er,
>> created through the Kubernetes API by a Kubernetes-aware submission clie=
nt
>> called by the spark-submit script. Because the driver runs in a Pod, it
>> is reachable by the executors in the cluster using its Pod IP. In the
>> client mode, the driver runs outside the cluster and calls the Kubernete=
s
>> API to create and destroy executor Pods. The driver must be routable fro=
m
>> within the cluster for the executors to communicate with it.
>>
>> The main component running in the driver is the
>> KubernetesClusterSchedulerBackend, an implementation of
>> CoarseGrainedSchedulerBackend, which manages allocating and destroying
>> executors via the Kubernetes API, as instructed by Spark core via calls =
to
>> methods doRequestTotalExecutors and doKillExecutors, respectively.
>> Within the KubernetesClusterSchedulerBackend, a separate
>> kubernetes-pod-allocator thread handles the creation of new executor
>> Pods with appropriate throttling and monitoring. Throttling is achieved
>> using a feedback loop that makes decision on submitting new requests for
>> executors based on whether previous executor Pod creation requests have
>> completed. This indirection is necessary because the Kubernetes API serv=
er
>> accepts requests for new Pods optimistically, with the anticipation of
>> being able to eventually schedule them to run. However, it is undesirabl=
e
>> to have a very large number of Pods that cannot be scheduled and stay
>> pending within the cluster. The throttling mechanism gives us control ov=
er
>> how fast an application scales up (which can be configured), and helps
>> prevent Spark applications from DOS-ing the Kubernetes API server with t=
oo
>> many Pod creation requests. The executor Pods simply run the
>> CoarseGrainedExecutorBackend class from a pre-built Docker image that
>> contains a Spark distribution.
>>
>> There are auxiliary and optional components: ResourceStagingServer and
>> KubernetesExternalShuffleService, which serve specific purposes
>> described below. The ResourceStagingServer serves as a file store (in
>> the absence of a persistent storage layer in Kubernetes) for application
>> dependencies uploaded from the submission client machine, which then get
>> downloaded from the server by the init-containers in the driver and
>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints fo=
r
>> uploading and downloading files, respectively. Security tokens are retur=
ned
>> in the responses for file uploading and must be carried in the requests =
for
>> downloading the files. The ResourceStagingServer is deployed as a
>> Kubernetes Service
>> <https://kubernetes.io/docs/concepts/services-networking/service/>
>> backed by a Deployment
>> <https://kubernetes.io/docs/concepts/workloads/controllers/deployment/>
>> in the cluster and multiple instances may be deployed in the same cluste=
r.
>> Spark applications specify which ResourceStagingServer instance to use
>> through a configuration property.
>>
>> The KubernetesExternalShuffleService is used to support dynamic resource
>> allocation, with which the number of executors of a Spark application ca=
n
>> change at runtime based on the resource needs. It provides an additional
>> endpoint for drivers that allows the shuffle service to delete driver
>> termination and clean up the shuffle files associated with corresponding
>> application. There are two ways of deploying the
>> KubernetesExternalShuffleService: running a shuffle service Pod on each
>> node in the cluster or a subset of the nodes using a DaemonSet
>> <https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/>,
>> or running a shuffle service container in each of the executor Pods. In =
the
>> first option, each shuffle service container mounts a hostPath
>> <https://kubernetes.io/docs/concepts/storage/volumes/#hostpath> volume.
>> The same hostPath volume is also mounted by each of the executor
>> containers, which must also have the environment variable
>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a shuffle
>> service container is co-located with an executor container in each of th=
e
>> executor Pods. The two containers share an emptyDir
>> <https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> volume
>> where the shuffle data gets written to. There may be multiple instances =
of
>> the shuffle service deployed in a cluster that may be used for different
>> versions of Spark, or for different priority levels with different resou=
rce
>> quotas.
>>
>> New Kubernetes-specific configuration options are also introduced to
>> facilitate specification and customization of driver and executor Pods a=
nd
>> related Kubernetes resources. For example, driver and executor Pods can =
be
>> created in a particular Kubernetes namespace and on a particular set of =
the
>> nodes in the cluster. Users are allowed to apply labels and annotations =
to
>> the driver and executor Pods.
>>
>> Additionally, secure HDFS support is being actively worked on following
>> the design here
>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_e=
wFuCNWKg/edit>.
>> Both short-running jobs and long-running jobs that need periodic delegat=
ion
>> token refresh are supported, leveraging built-in Kubernetes constructs l=
ike
>> Secrets. Please refer to the design doc for details.
>> Rejected DesignsResource Staging by the Driver
>>
>> A first implementation effectively included the ResourceStagingServer in
>> the driver container itself. The driver container ran a custom command t=
hat
>> opened an HTTP endpoint and waited for the submission client to send
>> resources to it. The server would then run the driver code after it had
>> received the resources from the submission client machine. The problem w=
ith
>> this approach is that the submission client needs to deploy the driver i=
n
>> such a way that the driver itself would be reachable from outside of the
>> cluster, but it is difficult for an automated framework which is not awa=
re
>> of the cluster's configuration to expose an arbitrary pod in a generic w=
ay.
>> The Service-based design chosen allows a cluster administrator to expose
>> the ResourceStagingServer in a manner that makes sense for their
>> cluster, such as with an Ingress or with a NodePort service.
>> Kubernetes External Shuffle Service
>>
>> Several alternatives were considered for the design of the shuffle
>> service. The first design postulated the use of long-lived executor pods
>> and sidecar containers in them running the shuffle service. The advantag=
e
>> of this model was that it would let us use emptyDir for sharing as oppos=
ed
>> to using node local storage, which guarantees better lifecycle managemen=
t
>> of storage by Kubernetes. The apparent disadvantage was that it would be=
 a
>> departure from the traditional Spark methodology of keeping executors fo=
r
>> only as long as required in dynamic allocation mode. It would additional=
ly
>> use up more resources than strictly necessary during the course of
>> long-running jobs, partially losing the advantage of dynamic scaling.
>>
>> Another alternative considered was to use a separate shuffle service
>> manager as a nameserver. This design has a few drawbacks. First, this me=
ans
>> another component that needs authentication/authorization management and
>> maintenance. Second, this separate component needs to be kept in sync wi=
th
>> the Kubernetes cluster. Last but not least, most of functionality of thi=
s
>> separate component can be performed by a combination of the in-cluster
>> shuffle service and the Kubernetes API server.
>> Pluggable Scheduler Backends
>>
>> Fully pluggable scheduler backends were considered as a more generalized
>> solution, and remain interesting as a possible avenue for future-proofin=
g
>> against new scheduling targets.  For the purposes of this project, addin=
g a
>> new specialized scheduler backend for Kubernetes was chosen as the appro=
ach
>> due to its very low impact on the core Spark code; making scheduler full=
y
>> pluggable would be a high-impact high-risk modification to Spark=E2=80=
=99s core
>> libraries. The pluggable scheduler backends effort is being tracked in
>> JIRA-19700 <https://issues.apache.org/jira/browse/SPARK-19700>.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

--94eb2c14c416735f8b0556cdd8c1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div><div><div dir=3D"auto">+1 (non-binding)</div><div dir=3D"auto"><br></d=
iv><div dir=3D"auto">I (personally) think that Kubernetes as a scheduler ba=
ckend should eventually get merged in and there is clearly a community inte=
rested in the work required to maintain it.</div><br><div class=3D"gmail_qu=
ote"><div>On Tue, Aug 15, 2017 at 9:51 AM William Benton &lt;<a href=3D"mai=
lto:willb@redhat.com">willb@redhat.com</a>&gt; wrote:<br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex"><div>+1 (non-binding)</div><div class=3D"gmail_extra"><b=
r><div class=3D"gmail_quote">On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Rama=
nathan <span>&lt;<a href=3D"mailto:foxish@google.com.invalid" target=3D"_bl=
ank">foxish@google.com.invalid</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div><span id=3D"m_5168181709863564029m_-7604622187718300839g=
mail-docs-internal-guid-01df7394-e67d-51c9-76d1-b89ba11594cf"><div>Spark on=
 Kubernetes effort has been developed separately in a fork, and linked back=
 from the Apache Spark project as=C2=A0<a href=3D"http://spark.apache.org/d=
ocs/latest/cluster-overview.html#cluster-manager-types" target=3D"_blank">a=
n experimental backend</a>. We&#39;re ~6 months in, have had=C2=A0<a href=
=3D"https://github.com/apache-spark-on-k8s/spark/releases" target=3D"_blank=
">5 releases</a>.=C2=A0<br></div><div><ul><li style=3D"margin-left:15px">2 =
Spark versions maintained (2.1, and 2.2)</li><li style=3D"margin-left:15px"=
>Extensive integration testing and refactoring efforts to maintain code qua=
lity</li><li style=3D"margin-left:15px"><a href=3D"https://github.com/apach=
e-spark-on-k8s/spark#getting-started" target=3D"_blank">Developer</a>=C2=A0=
and=C2=A0<a href=3D"https://apache-spark-on-k8s.github.io/userdocs/" target=
=3D"_blank">user-facing</a>=C2=A0documentation</li><li style=3D"margin-left=
:15px">10+ consistent code contributors from=C2=A0<a href=3D"https://apache=
-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions" tar=
get=3D"_blank">different organizations</a>=C2=A0involved in actively mainta=
ining and using the project, with several more members involved in testing =
and providing feedback.</li><li style=3D"margin-left:15px">The community ha=
s delivered several talks on Spark-on-Kubernetes generating lots of feedbac=
k from users.</li><li style=3D"margin-left:15px">In addition to these, we&#=
39;ve seen efforts spawn off such as:<br></li><ul><li style=3D"margin-left:=
15px"><a href=3D"https://github.com/apache-spark-on-k8s/kubernetes-HDFS" ta=
rget=3D"_blank">HDFS on Kubernetes</a>=C2=A0with Locality and Performance E=
xperiments<br></li><li style=3D"margin-left:15px"><a href=3D"https://docs.g=
oogle.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit" tar=
get=3D"_blank">Kerberized access</a>=C2=A0to HDFS from Spark running on Kub=
ernetes</li></ul></ul></div><p style=3D"line-height:1.38;margin-top:0pt;mar=
gin-bottom:3pt"><span style=3D"font-size:26pt;font-family:Arial;color:rgb(0=
,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-=
wrap"></span></p><div><div><b><span style=3D"font-size:12.8px">Following th=
e=C2=A0</span><span class=3D"m_5168181709863564029m_-7604622187718300839gma=
il-m_4908977744468987281gmail-m_5091327022349668319gmail-il" style=3D"font-=
size:12.8px">SPIP</span><span style=3D"font-size:12.8px">=C2=A0process, I&#=
39;m putting this=C2=A0</span><span class=3D"m_5168181709863564029m_-760462=
2187718300839gmail-m_4908977744468987281gmail-m_5091327022349668319gmail-il=
" style=3D"font-size:12.8px">SPIP</span><span style=3D"font-size:12.8px">=
=C2=A0up for a vote.</span></b><br></div><span id=3D"m_5168181709863564029m=
_-7604622187718300839gmail-m_4908977744468987281gmail-m_5091327022349668319=
gmail-docs-internal-guid-5bf68d19-e227-2783-86db-9bf86db1abe7"><ul style=3D=
"font-size:12.8px"><li style=3D"margin-left:15px">+1: Yeah, let&#39;s go fo=
rward and implement the SPIP.<br></li><li style=3D"margin-left:15px">+0: Do=
n&#39;t really care.<br></li><li style=3D"margin-left:15px">-1: I don&#39;t=
 think this is a good idea because of the following technical reasons.</li>=
</ul><div style=3D"font-size:12.8px">If there is any further clarification =
desired, on the design or the implementation, please feel free to ask quest=
ions or provide feedback.</div></span></div><p style=3D"line-height:1.38;ma=
rgin-top:0pt;margin-bottom:3pt"><br></p><p style=3D"line-height:1.38;margin=
-top:0pt;margin-bottom:3pt"><span style=3D"font-size:26pt;font-family:Arial=
;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;whit=
e-space:pre-wrap">SPIP: Kubernetes as A Native Cluster Manager</span></p><b=
r><p style=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-bottom=
:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);back=
ground-color:transparent;vertical-align:baseline;white-space:pre-wrap"><spa=
n style=3D"background-color:transparent;font-size:11pt;vertical-align:basel=
ine">Full Design Doc: </span><a href=3D"https://issues.apache.org/jira/secu=
re/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Pro=
posal%20Revision%202%20%281%29.pdf" style=3D"font-family:arial,sans-serif;f=
ont-size:small;white-space:normal;text-decoration-line:none" target=3D"_bla=
nk"><span style=3D"font-size:11pt;font-family:Arial;background-color:transp=
arent;text-decoration-line:underline;vertical-align:baseline;white-space:pr=
e-wrap">link</span></a><br></span></p><p style=3D"text-align:left;line-heig=
ht:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;fon=
t-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align=
:baseline;white-space:pre-wrap">JIRA: </span><a href=3D"https://issues.apac=
he.org/jira/browse/SPARK-18278" style=3D"text-decoration-line:none" target=
=3D"_blank"><span style=3D"font-size:11pt;font-family:Arial;background-colo=
r:transparent;text-decoration-line:underline;vertical-align:baseline;white-=
space:pre-wrap">https://issues.apache.org/jira/browse/SPARK-18278</span></a=
></p><p style=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-bot=
tom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap">K=
ubernetes Issue: </span><span style=3D"text-decoration-line:underline;font-=
size:11pt;font-family:Arial;background-color:transparent;vertical-align:bas=
eline;white-space:pre-wrap"><a href=3D"https://github.com/kubernetes/kubern=
etes/issues/34377" style=3D"text-decoration-line:none" target=3D"_blank">ht=
tps://github.com/kubernetes/kubernetes/issues/34377</a></span></p><br><p st=
yle=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-bottom:0pt"><=
span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap">Authors: Yi=
nan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt Cheah,</span><=
/p><p style=3D"text-align:left;line-height:1.38;margin-top:0pt;margin-botto=
m:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);bac=
kground-color:transparent;vertical-align:baseline;white-space:pre-wrap">Ila=
n Filonenko, Sean Suchter, Kimoon Kim</span></p><h1 style=3D"line-height:1.=
38;margin-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;font-fa=
mily:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;ve=
rtical-align:baseline;white-space:pre-wrap">Background and Motivation</span=
></h1><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span =
style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color=
:transparent;vertical-align:baseline;white-space:pre-wrap">Containerization=
 and cluster management technologies are constantly evolving in the cluster=
 computing world. Apache Spark currently implements support for Apache Hado=
op YARN and Apache Mesos, in addition to providing its own standalone clust=
er manager. In 2014, Google announced development of </span><a href=3D"http=
s://kubernetes.io/" style=3D"text-decoration-line:none" target=3D"_blank"><=
span style=3D"font-size:11pt;font-family:Arial;background-color:transparent=
;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wra=
p">Kubernetes</span></a><span style=3D"font-size:11pt;font-family:Arial;col=
or:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-sp=
ace:pre-wrap"> which has its own unique feature set and differentiates itse=
lf from YARN and Mesos. Since its debut, it has seen contributions from ove=
r 1300 contributors with over 50000 commits. Kubernetes has cemented itself=
 as a core player in the cluster computing world, and cloud-computing provi=
ders such as Google Container Engine, Google Compute Engine, Amazon Web Ser=
vices, and Microsoft Azure support running Kubernetes clusters.</span></p><=
br><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span sty=
le=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tr=
ansparent;vertical-align:baseline;white-space:pre-wrap">This document outli=
nes a proposal for integrating Apache Spark with Kubernetes in a first clas=
s way, adding Kubernetes to the list of cluster managers that Spark can be =
used with. Doing so would allow users to share their computing resources an=
d containerization framework between their existing applications on Kuberne=
tes and their computational Spark applications. Although there is existing =
support for </span><a href=3D"https://github.com/kubernetes/examples/blob/m=
aster/staging/spark/README.md" style=3D"text-decoration-line:none" target=
=3D"_blank"><span style=3D"font-size:11pt;font-family:Arial;background-colo=
r:transparent;text-decoration-line:underline;vertical-align:baseline;white-=
space:pre-wrap">running a Spark standalone cluster on Kubernetes</span></a>=
<span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background=
-color:transparent;vertical-align:baseline;white-space:pre-wrap">, there ar=
e still major advantages and significant interest in having native executio=
n support. For example, this integration provides better support for multi-=
tenancy and dynamic resource allocation. It also allows users to run applic=
ations of different Spark versions of their choices in the same cluster. </=
span></p><br><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"=
><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgroun=
d-color:transparent;vertical-align:baseline;white-space:pre-wrap">The featu=
re is being developed in a </span><a href=3D"https://github.com/apache-spar=
k-on-k8s/spark" style=3D"text-decoration-line:none" target=3D"_blank"><span=
 style=3D"font-size:11pt;font-family:Arial;background-color:transparent;tex=
t-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">s=
eparate fork</span></a><span style=3D"font-size:11pt;font-family:Arial;colo=
r:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spa=
ce:pre-wrap"> in order to minimize risk to the main project during developm=
ent. Since the start of the development in November of 2016, it has receive=
d over 100 commits from over 20 contributors and supports two releases base=
d on Spark 2.1 and 2.2 respectively. Documentation is also being actively w=
orked on both in the main project repository and also in the repository </s=
pan><a href=3D"https://github.com/apache-spark-on-k8s/userdocs" style=3D"te=
xt-decoration-line:none" target=3D"_blank"><span style=3D"font-size:11pt;fo=
nt-family:Arial;background-color:transparent;text-decoration-line:underline=
;vertical-align:baseline;white-space:pre-wrap">https://github.com/apache-sp=
ark-on-k8s/userdocs</span></a><span style=3D"font-size:11pt;font-family:Ari=
al;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap">. Regarding real-world use cases, we have seen cluster =
setup that uses 1000+ cores. We are also seeing growing interests on this p=
roject from more and more organizations.</span></p><br><p style=3D"line-hei=
ght:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;fo=
nt-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-alig=
n:baseline;white-space:pre-wrap">While it is easy to bootstrap the project =
in a forked repository, it is hard to maintain it in the long run because o=
f the tricky process of rebasing onto the upstream and lack of awareness in=
 the large Spark community. It would be beneficial to both the Spark and Ku=
bernetes community seeing this feature being merged upstream. On one hand, =
it gives Spark users the option of running their Spark workloads along with=
 other workloads that may already be running on Kubernetes, enabling better=
 resource sharing and isolation, and better cluster administration. On the =
other hand, it gives Kubernetes a leap forward in the area of large-scale d=
ata processing by being an officially supported cluster manager for Spark. =
The risk of merging into upstream is low because most of the changes are pu=
rely incremental, i.e., new Kubernetes-aware implementations of existing in=
terfaces/classes in Spark core are introduced. The development is also conc=
entrated in a single place at </span><a href=3D"https://github.com/apache-s=
park-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes" =
style=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"font-s=
ize:11pt;font-family:Arial;background-color:transparent;text-decoration-lin=
e:underline;vertical-align:baseline;white-space:pre-wrap">resource-managers=
/kubernetes</span></a><span style=3D"font-size:11pt;font-family:Arial;color=
:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spac=
e:pre-wrap">. The risk is further reduced by a comprehensive integration te=
st framework, and an active and responsive community of future maintainers.=
</span></p><h1 style=3D"line-height:1.38;margin-top:20pt;margin-bottom:6pt"=
><span style=3D"font-size:20pt;font-family:Arial;color:rgb(0,0,0);backgroun=
d-color:transparent;font-weight:400;vertical-align:baseline;white-space:pre=
-wrap">Target Personas</span></h1><p style=3D"line-height:1.38;margin-top:0=
pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color=
:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spac=
e:pre-wrap">Devops, data scientists, data engineers, application developers=
, anyone who can benefit from having </span><a href=3D"https://kubernetes.i=
o/docs/concepts/overview/what-is-kubernetes/" style=3D"text-decoration-line=
:none" target=3D"_blank"><span style=3D"font-size:11pt;font-family:Arial;ba=
ckground-color:transparent;text-decoration-line:underline;vertical-align:ba=
seline;white-space:pre-wrap">Kubernetes</span></a><span style=3D"font-size:=
11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertic=
al-align:baseline;white-space:pre-wrap"> as a native cluster manager for Sp=
ark. </span></p><h1 style=3D"line-height:1.38;margin-top:20pt;margin-bottom=
:6pt"><span style=3D"font-size:20pt;font-family:Arial;color:rgb(0,0,0);back=
ground-color:transparent;font-weight:400;vertical-align:baseline;white-spac=
e:pre-wrap">Goals</span></h1><ul style=3D"margin-top:0pt;margin-bottom:0pt"=
><li style=3D"list-style-type:disc;font-size:11pt;font-family:Arial;color:r=
gb(0,0,0);background-color:transparent;vertical-align:baseline"><p style=3D=
"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-siz=
e:11pt;background-color:transparent;vertical-align:baseline;white-space:pre=
-wrap">Make Kubernetes a first-class cluster manager for Spark, alongside S=
park Standalone, Yarn, and Mesos.</span></p></li><li style=3D"list-style-ty=
pe:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:=
transparent;vertical-align:baseline"><p style=3D"line-height:1.38;margin-to=
p:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;background-color:tra=
nsparent;vertical-align:baseline;white-space:pre-wrap">Support both client =
and cluster deployment mode.</span></p></li><li style=3D"list-style-type:di=
sc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:trans=
parent;vertical-align:baseline"><p style=3D"line-height:1.38;margin-top:0pt=
;margin-bottom:0pt"><span style=3D"font-size:11pt;background-color:transpar=
ent;vertical-align:baseline;white-space:pre-wrap">Support </span><a href=3D=
"http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-a=
llocation" style=3D"text-decoration-line:none" target=3D"_blank"><span styl=
e=3D"font-size:11pt;background-color:transparent;text-decoration-line:under=
line;vertical-align:baseline;white-space:pre-wrap">dynamic resource allocat=
ion</span></a><span style=3D"font-size:11pt;background-color:transparent;ve=
rtical-align:baseline;white-space:pre-wrap">.</span></p></li><li style=3D"l=
ist-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);backg=
round-color:transparent;vertical-align:baseline"><p style=3D"line-height:1.=
38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;backgrou=
nd-color:transparent;vertical-align:baseline;white-space:pre-wrap">Support =
Spark Java/Scala, PySpark, and Spark R applications.</span></p></li><li sty=
le=3D"list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0=
);background-color:transparent;vertical-align:baseline"><p style=3D"line-he=
ight:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap">S=
upport secure HDFS access.</span></p></li><li style=3D"list-style-type:disc=
;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa=
rent;vertical-align:baseline"><p style=3D"line-height:1.38;margin-top:0pt;m=
argin-bottom:0pt"><span style=3D"font-size:11pt;background-color:transparen=
t;vertical-align:baseline;white-space:pre-wrap">Allow running applications =
of different Spark versions in the same cluster through the ability to spec=
ify the driver and executor Docker images on a per-application basis.</span=
></p></li><li style=3D"list-style-type:disc;font-size:11pt;font-family:Aria=
l;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline"><p=
 style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D=
"font-size:11pt;background-color:transparent;vertical-align:baseline;white-=
space:pre-wrap">Support specification and enforcement of limits on both CPU=
 cores and memory.</span></p></li></ul><h1 style=3D"line-height:1.38;margin=
-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;font-family:Aria=
l;color:rgb(0,0,0);background-color:transparent;font-weight:400;vertical-al=
ign:baseline;white-space:pre-wrap">Non-Goals</span></h1><ul style=3D"margin=
-top:0pt;margin-bottom:0pt"><li style=3D"list-style-type:disc;font-size:11p=
t;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-=
align:baseline"><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0=
pt"><span style=3D"font-size:11pt;background-color:transparent;vertical-ali=
gn:baseline;white-space:pre-wrap">Support cluster resource scheduling and s=
haring beyond capabilities offered natively by the Kubernetes per-namespace=
 resource quota model.</span></p></li></ul><h1 style=3D"line-height:1.38;ma=
rgin-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20pt;font-family:=
Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;vertica=
l-align:baseline;white-space:pre-wrap">Proposed API Changes</span></h1><p s=
tyle=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"f=
ont-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpare=
nt;vertical-align:baseline;white-space:pre-wrap">Most API changes are purel=
y incremental, i.e., new Kubernetes-aware implementations of existing inter=
faces/classes in Spark core are introduced. Detailed changes are as follows=
.</span></p><ul style=3D"margin-top:0pt;margin-bottom:0pt"><li style=3D"lis=
t-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgro=
und-color:transparent;vertical-align:baseline"><p style=3D"line-height:1.38=
;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;background=
-color:transparent;vertical-align:baseline;white-space:pre-wrap">A new clus=
ter manager option </span><span style=3D"font-size:11pt;font-family:&quot;C=
ourier New&quot;;background-color:transparent;vertical-align:baseline;white=
-space:pre-wrap">KUBERNETES</span><span style=3D"font-size:11pt;background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap"> is introdu=
ced and some changes are made to </span><span style=3D"font-size:11pt;font-=
family:&quot;Courier New&quot;;background-color:transparent;vertical-align:=
baseline;white-space:pre-wrap">SparkSubmit</span><span style=3D"font-size:1=
1pt;background-color:transparent;vertical-align:baseline;white-space:pre-wr=
ap"> to make it be aware of this option. </span></p></li><li style=3D"list-=
style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgroun=
d-color:transparent;vertical-align:baseline"><p style=3D"line-height:1.38;m=
argin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;background-c=
olor:transparent;vertical-align:baseline;white-space:pre-wrap">A new implem=
entation of </span><span style=3D"font-size:11pt;font-family:&quot;Courier =
New&quot;;background-color:transparent;vertical-align:baseline;white-space:=
pre-wrap">CoarseGrainedSchedulerBackend</span><span style=3D"font-size:11pt=
;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"=
>, namely </span><span style=3D"font-size:11pt;font-family:&quot;Courier Ne=
w&quot;;background-color:transparent;vertical-align:baseline;white-space:pr=
e-wrap">KubernetesClusterSchedulerBackend</span><span style=3D"font-size:11=
pt;background-color:transparent;vertical-align:baseline;white-space:pre-wra=
p"> is responsible for managing the creation and deletion of executor Pods =
through the Kubernetes API.</span></p></li><li style=3D"list-style-type:dis=
c;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transp=
arent;vertical-align:baseline"><p style=3D"line-height:1.38;margin-top:0pt;=
margin-bottom:0pt"><span style=3D"font-size:11pt;background-color:transpare=
nt;vertical-align:baseline;white-space:pre-wrap">A new implementation of </=
span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;back=
ground-color:transparent;vertical-align:baseline;white-space:pre-wrap">Task=
SchedulerImpl</span><span style=3D"font-size:11pt;background-color:transpar=
ent;vertical-align:baseline;white-space:pre-wrap">, namely </span><span sty=
le=3D"font-size:11pt;font-family:&quot;Courier New&quot;;background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">KubernetesTaskSche=
dulerImpl</span><span style=3D"font-size:11pt;background-color:transparent;=
vertical-align:baseline;white-space:pre-wrap">, and a new implementation of=
 </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap">T=
askSetManager</span><span style=3D"font-size:11pt;background-color:transpar=
ent;vertical-align:baseline;white-space:pre-wrap">, namely </span><span sty=
le=3D"font-size:11pt;font-family:&quot;Courier New&quot;;background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">Kubernetes TaskSet=
Manager</span><span style=3D"font-size:11pt;background-color:transparent;ve=
rtical-align:baseline;white-space:pre-wrap">, are introduced for Kubernetes=
-aware task scheduling.</span></p></li><li style=3D"list-style-type:disc;fo=
nt-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparen=
t;vertical-align:baseline"><p style=3D"line-height:1.38;margin-top:0pt;marg=
in-bottom:0pt"><span style=3D"font-size:11pt;background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">When dynamic resource allocati=
on is enabled, a new implementation of </span><span style=3D"font-size:11pt=
;font-family:&quot;Courier New&quot;;background-color:transparent;vertical-=
align:baseline;white-space:pre-wrap">ExternalShuffleService</span><span sty=
le=3D"font-size:11pt;background-color:transparent;vertical-align:baseline;w=
hite-space:pre-wrap">, namely </span><span style=3D"font-size:11pt;font-fam=
ily:&quot;Courier New&quot;;background-color:transparent;vertical-align:bas=
eline;white-space:pre-wrap">KubernetesExternalShuffleService</span><span st=
yle=3D"font-size:11pt;background-color:transparent;vertical-align:baseline;=
white-space:pre-wrap"> is introduced.</span></p></li></ul><h1 style=3D"line=
-height:1.38;margin-top:20pt;margin-bottom:6pt"><span style=3D"font-size:20=
pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-wei=
ght:400;vertical-align:baseline;white-space:pre-wrap">Design Sketch</span><=
/h1><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span st=
yle=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">Below we briefly d=
escribe the design. For more details on the design and architecture, please=
 refer to the architecture </span><a href=3D"https://github.com/apache-spar=
k-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes/arch=
itecture-docs" style=3D"text-decoration-line:none" target=3D"_blank"><span =
style=3D"font-size:11pt;font-family:Arial;background-color:transparent;text=
-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">do=
cumentation</span></a><span style=3D"font-size:11pt;font-family:Arial;color=
:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spac=
e:pre-wrap">. The main idea of this design is to run Spark driver and execu=
tors inside Kubernetes </span><a href=3D"https://kubernetes.io/docs/concept=
s/workloads/pods/pod/" style=3D"text-decoration-line:none" target=3D"_blank=
"><span style=3D"font-size:11pt;font-family:Arial;background-color:transpar=
ent;text-decoration-line:underline;vertical-align:baseline;white-space:pre-=
wrap">Pods</span></a><span style=3D"font-size:11pt;font-family:Arial;color:=
rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space=
:pre-wrap">. Pods are a co-located and co-scheduled group of one or more co=
ntainers run in a shared context. The driver is responsible for creating an=
d destroying executor Pods through the Kubernetes API, while Kubernetes is =
fully responsible for scheduling the Pods to run on available nodes in the =
cluster. In the cluster mode, the driver also runs in a Pod in the cluster,=
 created through the Kubernetes API by a Kubernetes-aware submission client=
 called by the </span><span style=3D"font-size:11pt;font-family:&quot;Couri=
er New&quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline;white-space:pre-wrap">spark-submit</span><span style=3D"font-size:1=
1pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertica=
l-align:baseline;white-space:pre-wrap"> script. Because the driver runs in =
a Pod, it is reachable by the executors in the cluster using its Pod IP. In=
 the client mode, the driver runs outside the cluster and calls the Kuberne=
tes API to create and destroy executor Pods. The driver must be routable fr=
om within the cluster for the executors to communicate with it. </span></p>=
<br><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span st=
yle=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">The main component=
 running in the driver is the </span><span style=3D"font-size:11pt;font-fam=
ily:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">KubernetesClusterSchedulerBack=
end</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);=
background-color:transparent;vertical-align:baseline;white-space:pre-wrap">=
, an implementation of </span><span style=3D"font-size:11pt;font-family:&qu=
ot;Courier New&quot;;color:rgb(0,0,0);background-color:transparent;vertical=
-align:baseline;white-space:pre-wrap">CoarseGrainedSchedulerBackend</span><=
span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap">, which man=
ages allocating and destroying executors via the Kubernetes API, as instruc=
ted by Spark core via calls to methods </span><span style=3D"font-size:11pt=
;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">doRequestTotalExecuto=
rs</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap"> =
and </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot=
;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;whi=
te-space:pre-wrap">doKillExecutors</span><span style=3D"font-size:11pt;font=
-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:=
baseline;white-space:pre-wrap">, respectively. Within the </span><span styl=
e=3D"font-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);ba=
ckground-color:transparent;vertical-align:baseline;white-space:pre-wrap">Ku=
bernetesClusterSchedulerBackend</span><span style=3D"font-size:11pt;font-fa=
mily:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:bas=
eline;white-space:pre-wrap">, a separate </span><span style=3D"font-size:11=
pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:tr=
ansparent;vertical-align:baseline;white-space:pre-wrap">kubernetes-pod-allo=
cator</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0=
);background-color:transparent;vertical-align:baseline;white-space:pre-wrap=
"> thread handles the creation of new executor Pods with appropriate thrott=
ling and monitoring. Throttling is achieved using a feedback loop that make=
s decision on submitting new requests for executors based on whether previo=
us executor Pod creation requests have completed. This indirection is neces=
sary because the Kubernetes API server accepts requests for new Pods optimi=
stically, with the anticipation of being able to eventually schedule them t=
o run. However, it is undesirable to have a very large number of Pods that =
cannot be scheduled and stay pending within the cluster. The throttling mec=
hanism gives us control over how fast an application scales up (which can b=
e configured), and helps prevent Spark applications from DOS-ing the Kubern=
etes API server with too many Pod creation requests. The executor Pods simp=
ly run the </span><span style=3D"font-size:11pt;font-family:&quot;Courier N=
ew&quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:basel=
ine;white-space:pre-wrap">CoarseGrainedExecutorBackend</span><span style=3D=
"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa=
rent;vertical-align:baseline;white-space:pre-wrap"> class from a pre-built =
Docker image that contains a Spark distribution. </span></p><br><p style=3D=
"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-siz=
e:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vert=
ical-align:baseline;white-space:pre-wrap">There are auxiliary and optional =
components: </span><span style=3D"font-size:11pt;font-family:&quot;Courier =
New&quot;;color:rgb(0,0,0);background-color:transparent;vertical-align:base=
line;white-space:pre-wrap">ResourceStagingServer</span><span style=3D"font-=
size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap"> and </span><span style=3D"fon=
t-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background=
-color:transparent;vertical-align:baseline;white-space:pre-wrap">Kubernetes=
ExternalShuffleService</span><span style=3D"font-size:11pt;font-family:Aria=
l;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;whi=
te-space:pre-wrap">, which serve specific purposes described below. The </s=
pan><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;color=
:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spac=
e:pre-wrap">ResourceStagingServer</span><span style=3D"font-size:11pt;font-=
family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:b=
aseline;white-space:pre-wrap"> serves as a file store (in the absence of a =
persistent storage layer in Kubernetes) for application dependencies upload=
ed from the submission client machine, which then get downloaded from the s=
erver by the init-containers in the driver and executor Pods. It is a Jetty=
 server with JAX-RS and has two endpoints for uploading and downloading fil=
es, respectively. Security tokens are returned in the responses for file up=
loading and must be carried in the requests for downloading the files. The =
</span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quot;;co=
lor:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-s=
pace:pre-wrap">ResourceStagingServer</span><span style=3D"font-size:11pt;fo=
nt-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-alig=
n:baseline;white-space:pre-wrap"> is deployed as a Kubernetes </span><a hre=
f=3D"https://kubernetes.io/docs/concepts/services-networking/service/" styl=
e=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"font-size:=
11pt;font-family:Arial;background-color:transparent;text-decoration-line:un=
derline;vertical-align:baseline;white-space:pre-wrap">Service</span></a><sp=
an style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-co=
lor:transparent;vertical-align:baseline;white-space:pre-wrap"> backed by a =
</span><a href=3D"https://kubernetes.io/docs/concepts/workloads/controllers=
/deployment/" style=3D"text-decoration-line:none" target=3D"_blank"><span s=
tyle=3D"font-size:11pt;font-family:Arial;background-color:transparent;text-=
decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">Dep=
loyment</span></a><span style=3D"font-size:11pt;font-family:Arial;color:rgb=
(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pr=
e-wrap"> in the cluster and multiple instances may be deployed in the same =
cluster. Spark applications specify which </span><span style=3D"font-size:1=
1pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:t=
ransparent;vertical-align:baseline;white-space:pre-wrap">ResourceStagingSer=
ver</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);=
background-color:transparent;vertical-align:baseline;white-space:pre-wrap">=
 instance to use through a configuration property.</span></p><br><p style=
=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-=
size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;v=
ertical-align:baseline;white-space:pre-wrap">The </span><span style=3D"font=
-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap">KubernetesE=
xternalShuffleService</span><span style=3D"font-size:11pt;font-family:Arial=
;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;whit=
e-space:pre-wrap"> is used to support dynamic resource allocation, with whi=
ch the number of executors of a Spark application can change at runtime bas=
ed on the resource needs. It provides an additional endpoint for drivers th=
at allows the shuffle service to delete driver termination and clean up the=
 shuffle files associated with corresponding application. There are two way=
s of deploying the </span><span style=3D"font-size:11pt;font-family:&quot;C=
ourier New&quot;;color:rgb(0,0,0);background-color:transparent;vertical-ali=
gn:baseline;white-space:pre-wrap">KubernetesExternalShuffleService</span><s=
pan style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-c=
olor:transparent;vertical-align:baseline;white-space:pre-wrap">: running a =
shuffle service Pod on each node in the cluster or a subset of the nodes us=
ing a </span><a href=3D"https://kubernetes.io/docs/concepts/workloads/contr=
ollers/daemonset/" style=3D"text-decoration-line:none" target=3D"_blank"><s=
pan style=3D"font-size:11pt;font-family:Arial;background-color:transparent;=
text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap=
">DaemonSet</span></a><span style=3D"font-size:11pt;font-family:Arial;color=
:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-spac=
e:pre-wrap">, or running a shuffle service container in each of the executo=
r Pods. In the first option, each shuffle service container mounts a </span=
><a href=3D"https://kubernetes.io/docs/concepts/storage/volumes/#hostpath" =
style=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"font-s=
ize:11pt;font-family:Arial;background-color:transparent;text-decoration-lin=
e:underline;vertical-align:baseline;white-space:pre-wrap">hostPath</span></=
a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);backgrou=
nd-color:transparent;vertical-align:baseline;white-space:pre-wrap"> volume.=
 The same hostPath volume is also mounted by each of the executor container=
s, which must also have the environment variable </span><span style=3D"font=
-size:11pt;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-=
color:transparent;vertical-align:baseline;white-space:pre-wrap">SPARK_LOCAL=
_DIRS</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0=
);background-color:transparent;vertical-align:baseline;white-space:pre-wrap=
"> point to the hostPath. In the second option, a shuffle service container=
 is co-located with an executor container in each of the executor Pods. The=
 two containers share an </span><a href=3D"https://kubernetes.io/docs/conce=
pts/storage/volumes/#emptydir" style=3D"text-decoration-line:none" target=
=3D"_blank"><span style=3D"font-size:11pt;font-family:Arial;background-colo=
r:transparent;text-decoration-line:underline;vertical-align:baseline;white-=
space:pre-wrap">emptyDir</span></a><span style=3D"font-size:11pt;font-famil=
y:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseli=
ne;white-space:pre-wrap"> volume where the shuffle data gets written to. Th=
ere may be multiple instances of the shuffle service deployed in a cluster =
that may be used for different versions of Spark, or for different priority=
 levels with different resource quotas.</span></p><br><p style=3D"line-heig=
ht:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;fon=
t-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align=
:baseline;white-space:pre-wrap">New Kubernetes-specific configuration optio=
ns are also introduced to facilitate specification and customization of dri=
ver and executor Pods and related Kubernetes resources. For example, driver=
 and executor Pods can be created in a particular Kubernetes namespace and =
on a particular set of the nodes in the cluster. Users are allowed to apply=
 labels and annotations to the driver and executor Pods.</span></p><br><p s=
tyle=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"f=
ont-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transpare=
nt;vertical-align:baseline;white-space:pre-wrap">Additionally, secure HDFS =
support is being actively worked on following the design </span><a href=3D"=
https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuC=
NWKg/edit" style=3D"text-decoration-line:none" target=3D"_blank"><span styl=
e=3D"font-size:11pt;font-family:Arial;background-color:transparent;text-dec=
oration-line:underline;vertical-align:baseline;white-space:pre-wrap">here</=
span></a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);b=
ackground-color:transparent;vertical-align:baseline;white-space:pre-wrap">.=
 Both short-running jobs and long-running jobs that need periodic delegatio=
n token refresh are supported, leveraging built-in Kubernetes constructs li=
ke Secrets. Please refer to the design doc for details. </span></p><h1 styl=
e=3D"line-height:1.38;margin-top:20pt;margin-bottom:6pt"><span style=3D"fon=
t-size:20pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent=
;font-weight:400;vertical-align:baseline;white-space:pre-wrap">Rejected Des=
igns</span></h1><h2 style=3D"line-height:1.38;margin-top:18pt;margin-bottom=
:6pt"><span style=3D"font-size:16pt;font-family:Arial;color:rgb(0,0,0);back=
ground-color:transparent;font-weight:400;vertical-align:baseline;white-spac=
e:pre-wrap">Resource Staging by the Driver</span></h2><p style=3D"line-heig=
ht:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;fon=
t-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical-align=
:baseline;white-space:pre-wrap">A first implementation effectively included=
 the </span><span style=3D"font-size:11pt;font-family:&quot;Courier New&quo=
t;;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap">ResourceStagingServer</span><span style=3D"font-size:11=
pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;vertical=
-align:baseline;white-space:pre-wrap"> in the driver container itself. The =
driver container ran a custom command that opened an HTTP endpoint and wait=
ed for the submission client to send resources to it. The server would then=
 run the driver code after it had received the resources from the submissio=
n client machine. The problem with this approach is that the submission cli=
ent needs to deploy the driver in such a way that the driver itself would b=
e reachable from outside of the cluster, but it is difficult for an automat=
ed framework which is not aware of the cluster&#39;s configuration to expos=
e an arbitrary pod in a generic way. The Service-based design chosen allows=
 a cluster administrator to expose the </span><span style=3D"font-size:11pt=
;font-family:&quot;Courier New&quot;;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">ResourceStagingServer=
</span><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);bac=
kground-color:transparent;vertical-align:baseline;white-space:pre-wrap"> in=
 a manner that makes sense for their cluster, such as with an Ingress or wi=
th a NodePort service.</span></p><h2 style=3D"line-height:1.38;margin-top:1=
8pt;margin-bottom:6pt"><span style=3D"font-size:16pt;font-family:Arial;colo=
r:rgb(0,0,0);background-color:transparent;font-weight:400;vertical-align:ba=
seline;white-space:pre-wrap">Kubernetes External Shuffle Service</span></h2=
><p style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=
=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tran=
sparent;vertical-align:baseline;white-space:pre-wrap">Several alternatives =
were considered for the design of the shuffle service. The first design pos=
tulated the use of long-lived executor pods and sidecar containers in them =
running the shuffle service. The advantage of this model was that it would =
let us use emptyDir for sharing as opposed to using node local storage, whi=
ch guarantees better lifecycle management of storage by Kubernetes. The app=
arent disadvantage was that it would be a departure from the traditional Sp=
ark methodology of keeping executors for only as long as required in dynami=
c allocation mode. It would additionally use up more resources than strictl=
y necessary during the course of long-running jobs, partially losing the ad=
vantage of dynamic scaling.</span></p><br><p style=3D"line-height:1.38;marg=
in-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Ari=
al;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;wh=
ite-space:pre-wrap">Another alternative considered was to use a separate sh=
uffle service manager as a nameserver. This design has a few drawbacks. Fir=
st, this means another component that needs authentication/authorization ma=
nagement and maintenance. Second, this separate component needs to be kept =
in sync with the Kubernetes cluster. Last but not least, most of functional=
ity of this separate component can be performed by a combination of the in-=
cluster shuffle service and the Kubernetes API server.</span></p><h2 style=
=3D"line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style=3D"font=
-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;=
font-weight:400;vertical-align:baseline;white-space:pre-wrap">Pluggable Sch=
eduler Backends</span></h2><p style=3D"line-height:1.38;margin-top:0pt;marg=
in-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,=
0,0);background-color:transparent;vertical-align:baseline;white-space:pre-w=
rap">Fully pluggable scheduler backends were considered as a more generaliz=
ed solution, and remain interesting as a possible avenue for future-proofin=
g against new scheduling targets.=C2=A0 For the purposes of this project, a=
dding a new specialized scheduler backend for Kubernetes was chosen as the =
approach due to its very low impact on the core Spark code; making schedule=
r fully pluggable would be a high-impact high-risk modification to Spark=E2=
=80=99s core libraries. The pluggable scheduler backends effort is being tr=
acked in </span><a href=3D"https://issues.apache.org/jira/browse/SPARK-1970=
0" style=3D"text-decoration-line:none" target=3D"_blank"><span style=3D"fon=
t-size:11pt;font-family:Arial;background-color:transparent;text-decoration-=
line:underline;vertical-align:baseline;white-space:pre-wrap">JIRA-19700</sp=
an></a><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);bac=
kground-color:transparent;vertical-align:baseline;white-space:pre-wrap">.</=
span></p><div><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0=
,0);background-color:transparent;vertical-align:baseline;white-space:pre-wr=
ap"><br></span></div></span></div>
<br><br>
---------------------------------------------------------------------<br>
To unsubscribe e-mail: <a href=3D"mailto:dev-unsubscribe@spark.apache.org" =
target=3D"_blank">dev-unsubscribe@spark.apache.org</a><br></blockquote></di=
v><br></div>
</blockquote></div></div></div><div dir=3D"ltr">-- <br></div><div class=3D"=
gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><=
div dir=3D"ltr"><div>Cell : 425-233-8271</div><div>Twitter:=C2=A0<a href=3D=
"https://twitter.com/holdenkarau" target=3D"_blank">https://twitter.com/hol=
denkarau</a></div></div></div></div></div>

--94eb2c14c416735f8b0556cdd8c1--