spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: SPIP: Spark on Kubernetes
Date Tue, 15 Aug 2017 17:09:32 GMT
+1 (non-binding)

I (personally) think that Kubernetes as a scheduler backend should
eventually get merged in and there is clearly a community interested in the
work required to maintain it.

On Tue, Aug 15, 2017 at 9:51 AM William Benton <> wrote:

> +1 (non-binding)
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
>> wrote:
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> <>.
>> We're ~6 months in, have had 5 releases
>> <>.
>>    - 2 Spark versions maintained (2.1, and 2.2)
>>    - Extensive integration testing and refactoring efforts to maintain
>>    code quality
>>    - Developer
>>    <> and
>>    user-facing <>
>>     documentation
>>    - 10+ consistent code contributors from different organizations
>>    <>
>>    in actively maintaining and using the project, with several more members
>>    involved in testing and providing feedback.
>>    - The community has delivered several talks on Spark-on-Kubernetes
>>    generating lots of feedback from users.
>>    - In addition to these, we've seen efforts spawn off such as:
>>    - HDFS on Kubernetes
>>       <> with
>>       Locality and Performance Experiments
>>       - Kerberized access
>>       <>
>>       HDFS from Spark running on Kubernetes
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>    - +1: Yeah, let's go forward and implement the SPIP.
>>    - +0: Don't really care.
>>    - -1: I don't think this is a good idea because of the following
>>    technical reasons.
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>> SPIP: Kubernetes as A Native Cluster Manager
>> Full Design Doc: link
>> <>
>> JIRA:
>> Kubernetes Issue:
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes <> which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 50000 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> <>,
>> there are still major advantages and significant interest in having native
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users to
>> run applications of different Spark versions of their choices in the same
>> cluster.
>> The feature is being developed in a separate fork
>> <> in order to minimize risk
>> to the main project during development. Since the start of the development
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> Regarding real-world
>> use cases, we have seen cluster setup that uses 1000+ cores. We are also
>> seeing growing interests on this project from more and more organizations.
>> While it is easy to bootstrap the project in a forked repository, it is
>> hard to maintain it in the long run because of the tricky process of
>> rebasing onto the upstream and lack of awareness in the large Spark
>> community. It would be beneficial to both the Spark and Kubernetes
>> community seeing this feature being merged upstream. On one hand, it gives
>> Spark users the option of running their Spark workloads along with other
>> workloads that may already be running on Kubernetes, enabling better
>> resource sharing and isolation, and better cluster administration. On the
>> other hand, it gives Kubernetes a leap forward in the area of large-scale
>> data processing by being an officially supported cluster manager for Spark.
>> The risk of merging into upstream is low because most of the changes are
>> purely incremental, i.e., new Kubernetes-aware implementations of existing
>> interfaces/classes in Spark core are introduced. The development is also
>> concentrated in a single place at resource-managers/kubernetes
>> <>.
>> The risk is further reduced by a comprehensive integration test framework,
>> and an active and responsive community of future maintainers.
>> Target Personas
>> Devops, data scientists, data engineers, application developers, anyone
>> who can benefit from having Kubernetes
>> <> as a
>> native cluster manager for Spark.
>> Goals
>>    -
>>    Make Kubernetes a first-class cluster manager for Spark, alongside
>>    Spark Standalone, Yarn, and Mesos.
>>    -
>>    Support both client and cluster deployment mode.
>>    -
>>    Support dynamic resource allocation
>>    <>
>>    .
>>    -
>>    Support Spark Java/Scala, PySpark, and Spark R applications.
>>    -
>>    Support secure HDFS access.
>>    -
>>    Allow running applications of different Spark versions in the same
>>    cluster through the ability to specify the driver and executor Docker
>>    images on a per-application basis.
>>    -
>>    Support specification and enforcement of limits on both CPU cores and
>>    memory.
>> Non-Goals
>>    -
>>    Support cluster resource scheduling and sharing beyond capabilities
>>    offered natively by the Kubernetes per-namespace resource quota model.
>> Proposed API Changes
>> Most API changes are purely incremental, i.e., new Kubernetes-aware
>> implementations of existing interfaces/classes in Spark core are
>> introduced. Detailed changes are as follows.
>>    -
>>    A new cluster manager option KUBERNETES is introduced and some
>>    changes are made to SparkSubmit to make it be aware of this option.
>>    -
>>    A new implementation of CoarseGrainedSchedulerBackend, namely
>>    KubernetesClusterSchedulerBackend is responsible for managing the
>>    creation and deletion of executor Pods through the Kubernetes API.
>>    -
>>    A new implementation of TaskSchedulerImpl, namely
>>    KubernetesTaskSchedulerImpl, and a new implementation of
>>    TaskSetManager, namely Kubernetes TaskSetManager, are introduced for
>>    Kubernetes-aware task scheduling.
>>    -
>>    When dynamic resource allocation is enabled, a new implementation of
>>    ExternalShuffleService, namely KubernetesExternalShuffleService is
>>    introduced.
>> Design Sketch
>> Below we briefly describe the design. For more details on the design and
>> architecture, please refer to the architecture documentation
>> <>.
>> The main idea of this design is to run Spark driver and executors inside
>> Kubernetes Pods <>.
>> Pods are a co-located and co-scheduled group of one or more containers run
>> in a shared context. The driver is responsible for creating and destroying
>> executor Pods through the Kubernetes API, while Kubernetes is fully
>> responsible for scheduling the Pods to run on available nodes in the
>> cluster. In the cluster mode, the driver also runs in a Pod in the cluster,
>> created through the Kubernetes API by a Kubernetes-aware submission client
>> called by the spark-submit script. Because the driver runs in a Pod, it
>> is reachable by the executors in the cluster using its Pod IP. In the
>> client mode, the driver runs outside the cluster and calls the Kubernetes
>> API to create and destroy executor Pods. The driver must be routable from
>> within the cluster for the executors to communicate with it.
>> The main component running in the driver is the
>> KubernetesClusterSchedulerBackend, an implementation of
>> CoarseGrainedSchedulerBackend, which manages allocating and destroying
>> executors via the Kubernetes API, as instructed by Spark core via calls to
>> methods doRequestTotalExecutors and doKillExecutors, respectively.
>> Within the KubernetesClusterSchedulerBackend, a separate
>> kubernetes-pod-allocator thread handles the creation of new executor
>> Pods with appropriate throttling and monitoring. Throttling is achieved
>> using a feedback loop that makes decision on submitting new requests for
>> executors based on whether previous executor Pod creation requests have
>> completed. This indirection is necessary because the Kubernetes API server
>> accepts requests for new Pods optimistically, with the anticipation of
>> being able to eventually schedule them to run. However, it is undesirable
>> to have a very large number of Pods that cannot be scheduled and stay
>> pending within the cluster. The throttling mechanism gives us control over
>> how fast an application scales up (which can be configured), and helps
>> prevent Spark applications from DOS-ing the Kubernetes API server with too
>> many Pod creation requests. The executor Pods simply run the
>> CoarseGrainedExecutorBackend class from a pre-built Docker image that
>> contains a Spark distribution.
>> There are auxiliary and optional components: ResourceStagingServer and
>> KubernetesExternalShuffleService, which serve specific purposes
>> described below. The ResourceStagingServer serves as a file store (in
>> the absence of a persistent storage layer in Kubernetes) for application
>> dependencies uploaded from the submission client machine, which then get
>> downloaded from the server by the init-containers in the driver and
>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints for
>> uploading and downloading files, respectively. Security tokens are returned
>> in the responses for file uploading and must be carried in the requests for
>> downloading the files. The ResourceStagingServer is deployed as a
>> Kubernetes Service
>> <>
>> backed by a Deployment
>> <>
>> in the cluster and multiple instances may be deployed in the same cluster.
>> Spark applications specify which ResourceStagingServer instance to use
>> through a configuration property.
>> The KubernetesExternalShuffleService is used to support dynamic resource
>> allocation, with which the number of executors of a Spark application can
>> change at runtime based on the resource needs. It provides an additional
>> endpoint for drivers that allows the shuffle service to delete driver
>> termination and clean up the shuffle files associated with corresponding
>> application. There are two ways of deploying the
>> KubernetesExternalShuffleService: running a shuffle service Pod on each
>> node in the cluster or a subset of the nodes using a DaemonSet
>> <>,
>> or running a shuffle service container in each of the executor Pods. In the
>> first option, each shuffle service container mounts a hostPath
>> <> volume.
>> The same hostPath volume is also mounted by each of the executor
>> containers, which must also have the environment variable
>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a shuffle
>> service container is co-located with an executor container in each of the
>> executor Pods. The two containers share an emptyDir
>> <> volume
>> where the shuffle data gets written to. There may be multiple instances of
>> the shuffle service deployed in a cluster that may be used for different
>> versions of Spark, or for different priority levels with different resource
>> quotas.
>> New Kubernetes-specific configuration options are also introduced to
>> facilitate specification and customization of driver and executor Pods and
>> related Kubernetes resources. For example, driver and executor Pods can be
>> created in a particular Kubernetes namespace and on a particular set of the
>> nodes in the cluster. Users are allowed to apply labels and annotations to
>> the driver and executor Pods.
>> Additionally, secure HDFS support is being actively worked on following
>> the design here
>> <>.
>> Both short-running jobs and long-running jobs that need periodic delegation
>> token refresh are supported, leveraging built-in Kubernetes constructs like
>> Secrets. Please refer to the design doc for details.
>> Rejected DesignsResource Staging by the Driver
>> A first implementation effectively included the ResourceStagingServer in
>> the driver container itself. The driver container ran a custom command that
>> opened an HTTP endpoint and waited for the submission client to send
>> resources to it. The server would then run the driver code after it had
>> received the resources from the submission client machine. The problem with
>> this approach is that the submission client needs to deploy the driver in
>> such a way that the driver itself would be reachable from outside of the
>> cluster, but it is difficult for an automated framework which is not aware
>> of the cluster's configuration to expose an arbitrary pod in a generic way.
>> The Service-based design chosen allows a cluster administrator to expose
>> the ResourceStagingServer in a manner that makes sense for their
>> cluster, such as with an Ingress or with a NodePort service.
>> Kubernetes External Shuffle Service
>> Several alternatives were considered for the design of the shuffle
>> service. The first design postulated the use of long-lived executor pods
>> and sidecar containers in them running the shuffle service. The advantage
>> of this model was that it would let us use emptyDir for sharing as opposed
>> to using node local storage, which guarantees better lifecycle management
>> of storage by Kubernetes. The apparent disadvantage was that it would be a
>> departure from the traditional Spark methodology of keeping executors for
>> only as long as required in dynamic allocation mode. It would additionally
>> use up more resources than strictly necessary during the course of
>> long-running jobs, partially losing the advantage of dynamic scaling.
>> Another alternative considered was to use a separate shuffle service
>> manager as a nameserver. This design has a few drawbacks. First, this means
>> another component that needs authentication/authorization management and
>> maintenance. Second, this separate component needs to be kept in sync with
>> the Kubernetes cluster. Last but not least, most of functionality of this
>> separate component can be performed by a combination of the in-cluster
>> shuffle service and the Kubernetes API server.
>> Pluggable Scheduler Backends
>> Fully pluggable scheduler backends were considered as a more generalized
>> solution, and remain interesting as a possible avenue for future-proofing
>> against new scheduling targets.  For the purposes of this project, adding a
>> new specialized scheduler backend for Kubernetes was chosen as the approach
>> due to its very low impact on the core Spark code; making scheduler fully
>> pluggable would be a high-impact high-risk modification to Spark’s core
>> libraries. The pluggable scheduler backends effort is being tracked in
>> JIRA-19700 <>.
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:
> --
Cell : 425-233-8271

View raw message