spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From foxish <...@git.apache.org>
Subject [GitHub] spark pull request #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - D...
Date Wed, 13 Dec 2017 15:30:45 GMT
Github user foxish commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19946#discussion_r156691165
  
    --- Diff: docs/running-on-kubernetes.md ---
    @@ -0,0 +1,498 @@
    +---
    +layout: global
    +title: Running Spark on Kubernetes
    +---
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Spark can run on clusters managed by [Kubernetes](https://kubernetes.io). This feature
makes use of the new experimental native
    +Kubernetes scheduler that has been added to Spark.
    +
    +# Prerequisites
    +
    +* A runnable distribution of Spark 2.3 or above.
    +* A running Kubernetes cluster at version >= 1.6 with access configured to it using
    +[kubectl](https://kubernetes.io/docs/user-guide/prereqs/).  If you do not already have
a working Kubernetes cluster,
    +you may setup a test cluster on your local machine using
    +[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/).
    +  * We recommend using the latest releases of minikube be updated to the most recent
version with the DNS addon enabled.
    +* You must have appropriate permissions to list, create, edit and delete
    +[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You can verify that
you can list these resources
    +by running `kubectl auth can-i <list|create|edit|delete> pods`.
    +  * The service account credentials used by the driver pods must be allowed to create
pods, services and configmaps.
    +* You must have [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)
configured in your cluster.
    +
    +# How it works
    +
    +<p style="text-align: center;">
    +  <img src="img/k8s-cluster-mode.png" title="Spark cluster components" alt="Spark
cluster components" />
    +</p>
    +
    +spark-submit can be directly used to submit a Spark application to a Kubernetes cluster.
The mechanism by which spark-submit happens is as follows:
    +
    +* Spark creates a spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/).
    +* The driver creates executors which are also running within Kubernetes pods and connects
to them, and executes application code.
    +* When the application completes, the executor pods terminate and are cleaned up, but
the driver pod persists
    +logs and remains in "completed" state in the Kubernetes API till it's eventually garbage
collected or manually cleaned up.
    +
    +Note that in the completed state, the driver pod does *not* use any computational or
memory resources.
    +
    +The driver and executor pod scheduling is handled by Kubernetes. It will be possible
to affect Kubernetes scheduling
    +decisions for driver and executor pods using advanced primitives like
    +[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
    +and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity)
    +in a future release.
    +
    +# Submitting Applications to Kubernetes
    +
    +## Docker Images
    +
    +Kubernetes requires users to supply images that can be deployed into containers within
pods. The images are built to
    +be run in a container runtime environment that Kubernetes supports. Docker is a container
runtime environment that is
    +frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the
runnable distribution that can be customized
    +and built for your usage.
    +
    +You may build these docker images from sources.
    +There is a script, `sbin/build-push-docker-images.sh` that you can use to build and push
    +customized spark distribution images consisting of all the above components.
    +
    +Example usage is:
    +
    +    ./sbin/build-push-docker-images.sh -r <repo> -t my-tag build
    +    ./sbin/build-push-docker-images.sh -r <repo> -t my-tag push
    +
    +Docker files are under the `dockerfiles/` and can be customized further before
    +building using the supplied script, or manually.
    +
    +## Cluster Mode
    +
    +To launch Spark Pi in cluster mode,
    +
    +{% highlight bash %}
    +$ bin/spark-submit \
    +    --deploy-mode cluster \
    +    --class org.apache.spark.examples.SparkPi \
    +    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    +    --conf spark.kubernetes.namespace=default \
    +    --conf spark.executor.instances=5 \
    +    --conf spark.app.name=spark-pi \
    +    --conf spark.kubernetes.driver.docker.image=<driver-image> \
    +    --conf spark.kubernetes.executor.docker.image=<executor-image> \
    +    local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
    +{% endhighlight %}
    +
    +The Spark master, specified either via passing the `--master` command line argument to
`spark-submit` or by setting
    +`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url>`.
Prefixing the
    +master string with `k8s://` will cause the Spark application to launch on the Kubernetes
cluster, with the API server
    +being contacted at `api_server_url`. If no HTTP protocol is specified in the URL, it
defaults to `https`. For example,
    +setting the master to `k8s://example.com:443` is equivalent to setting it to `k8s://https://example.com:443`,
but to
    +connect without TLS on a different port, the master would be set to `k8s://http://example.com:8080`.
    +
    +If you have a Kubernetes cluster setup, one way to discover the apiserver URL is by executing
`kubectl cluster-info`.
    +
    +```bash
    +kubectl cluster-info
    +Kubernetes master is running at http://127.0.0.1:6443
    +```
    +
    +In the above example, the specific Kubernetes cluster can be used with spark submit by
specifying
    +`--master k8s://http://127.0.0.1:6443` as an argument to spark-submit. Additionally,
it is also possible to use the
    +authenticating proxy, `kubectl proxy` to communicate to the Kubernetes API.
    +
    +The local proxy can be started by:
    +
    +```bash
    + kubectl proxy
    +```
    +
    +If the local proxy is running at localhost:8001, `--master k8s://http://127.0.0.1:8001`
can be used as the argument to
    +spark-submit. Finally, notice that in the above example we specify a jar with a specific
URI with a scheme of `local://`.
    +This URI is the location of the example jar that is already in the Docker image.
    +
    +## Dependency Management
    +
    +If your application's dependencies are all hosted in remote locations like HDFS or http
servers, they may be referred to
    +by their appropriate remote URIs. Also, application dependencies can be pre-mounted into
custom-built Docker images.
    +Those dependencies can be added to the classpath by referencing them with `local://`
URIs and/or setting the
    +`SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
    +
    +## Introspection and Debugging
    +
    +These are the different ways in which you can investigate a running/completed Spark application,
monitor progress, and
    +take actions.
    +
    +### Accessing Logs
    +
    +Logs can be accessed using the kubernetes API and the `kubectl` CLI. When a Spark application
is running, it's possible
    +to stream logs from the application using:
    +
    +```bash
    +kubectl -n=<namespace> logs -f <driver-pod-name>
    +```
    +
    +The same logs can also be accessed through the
    +[kubernetes dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/)
if installed on
    +the cluster.
    +
    +### Accessing Driver UI
    +
    +The UI associated with any application can be accessed locally using
    +[`kubectl port-forward`](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
    +
    +```bash
    +kubectl port-forward <driver-pod-name> 4040:4040
    +```
    +
    +Then, the spark driver UI can be accessed on `http://localhost:4040`.
    +
    +### Debugging 
    +
    +There may be several kinds of failures. If the Kubernetes API server rejects the request
made from spark-submit, or the
    +connection is refused for a different reason, the submission logic should indicate the
error encountered. However, if there
    +are errors during the running of the application, often, the best way to investigate
may be through the kubernetes CLI.
    +
    +To get some basic information about the scheduling decisions made around the driver pod,
you can run:
    +
    +```bash
    +kubectl describe pod <spark-driver-pod>
    +```
    +
    +If the pod has encountered a runtime error, the status can be probed further using:
    +
    +```bash
    +kubectl logs <spark-driver-pod>
    +```
    +
    +Status and logs of failed executor pods can be checked in similar ways. Finally, deleting
the driver pod will clean up the entire spark 
    +application, includling all executors, associated service, etc. The driver pod can be
thought of as the Kubernetes representation of 
    +the spark application.
    +
    +## Kubernetes Features
    +
    +### Namespaces
    +
    +Kubernetes has the concept of [namespaces](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/).
    +Namespaces are ways to divide cluster resources between multiple users (via resource
quota). Spark on Kubernetes can
    +use namespaces to launch spark applications. This is through the `--conf spark.kubernetes.namespace`
argument to spark-submit.
    +
    +Kubernetes allows using [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
to set limits on
    +resources, number of objects, etc on individual namespaces. Namespaces and ResourceQuota
can be used in combination by
    +administrator to control sharing and resource allocation in a Kubernetes cluster running
Spark applications.
    +
    +### RBAC
    +
    +In Kubernetes clusters with [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)
enabled, users can configure
    +Kubernetes RBAC roles and service accounts used by the various Spark on Kubernetes components
to access the Kubernetes
    +API server.
    +
    +The Spark driver pod uses a Kubernetes service account to access the Kubernetes API server
to create and watch executor
    +pods. The service account used by the driver pod must have the appropriate permission
for the driver to be able to do
    +its work. Specifically, at minimum, the service account must be granted a
    +[`Role` or `ClusterRole`](https://kubernetes.io/docs/admin/authorization/rbac/#role-and-clusterrole)
that allows driver
    +pods to create pods and services. By default, the driver pod is automatically assigned
the `default` service account in
    +the namespace specified by `spark.kubernetes.namespace`, if no service account is specified
when the pod gets created.
    +
    +Depending on the version and setup of Kubernetes deployed, this `default` service account
may or may not have the role
    +that allows driver pods to create pods and services under the default Kubernetes
    +[RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) policies. Sometimes users
may need to specify a custom
    +service account that has the right role granted. Spark on Kubernetes supports specifying
a custom service account to
    +be used by the driver pod through the configuration property
    +`spark.kubernetes.authenticate.driver.serviceAccountName=<service account name>`.
For example to make the driver pod
    +to use the `spark` service account, a user simply adds the following option to the `spark-submit`
command:
    +
    +```
    +--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
    +```
    +
    +To create a custom service account, a user can use the `kubectl create serviceaccount`
command. For example, the
    +following command creates a service account named `spark`:
    +
    +```bash
    +kubectl create serviceaccount spark
    +```
    +
    +To grant a service account a `Role` or `ClusterRole`, a `RoleBinding` or `ClusterRoleBinding`
is needed. To create
    +a `RoleBinding` or `ClusterRoleBinding`, a user can use the `kubectl create rolebinding`
(or `clusterrolebinding`
    +for `ClusterRoleBinding`) command. For example, the following command creates an `edit`
`ClusterRole` in the `default`
    +namespace and grants it to the `spark` service account created above:
    +
    +```bash
    +kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark
--namespace=default
    +```
    +
    +Note that a `Role` can only be used to grant access to resources (like pods) within a
single namespace, whereas a
    +`ClusterRole` can be used to grant access to cluster-scoped resources (like nodes) as
well as namespaced resources
    +(like pods) across all namespaces. For Spark on Kubernetes, since the driver always creates
executor pods in the
    +same namespace, a `Role` is sufficient, although users may use a `ClusterRole` instead.
For more information on
    +RBAC authorization and how to configure Kubernetes service accounts for pods, please
refer to
    +[Using RBAC Authorization](https://kubernetes.io/docs/admin/authorization/rbac/) and
    +[Configure Service Accounts for Pods](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
    +
    +## Client Mode
    +
    +Client mode is not currently supported.
    +
    +## Future Work
    +
    +There are several Spark on Kubernetes features that are currently being incubated in
a fork -
    +[apache-spark-on-k8s/spark](https://github.com/apache-spark-on-k8s/spark), which are
expected to eventually make it into
    +future versions of the spark-kubernetes integration.
    +
    +Some of these include:
    +
    +* PySpark
    +* R
    +* Dynamic Executor Scaling
    +* Local File Dependency Management
    +* Spark Application Management
    +* Job Queues and Resource Management
    +
    +You can refer to the [documentation](https://apache-spark-on-k8s.github.io/userdocs/)
if you want to try these features
    +and provide feedback to the development team.
    +
    +# Configuration
    +
    +See the [configuration page](configuration.html) for information on Spark configurations.
 The following configurations are
    +specific to Spark on Kubernetes.
    +
    +#### Spark Properties
    +
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.kubernetes.namespace</code></td>
    +  <td><code>default</code></td>
    +  <td>
    +    The namespace that will be used for running the driver and executor pods.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.driver.docker.image</code></td>
    +  <td><code>(none)</code></td>
    +  <td>
    +    Docker image to use for the driver. Specify this using the standard
    +    <a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker
tag</a> format.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.executor.docker.image</code></td>
    +  <td><code>(none)</code></td>
    +  <td>
    +    Docker image to use for the executors. Specify this using the standard
    +    <a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker
tag</a> format.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.allocation.batch.size</code></td>
    +  <td><code>5</code></td>
    +  <td>
    +    Number of pods to launch at once in each round of executor pod allocation.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.allocation.batch.delay</code></td>
    +  <td><code>1</code></td>
    +  <td>
    +    Number of seconds to wait between each round of executor pod allocation.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.submission.caCertFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the CA cert file for connecting to the Kubernetes API server over TLS when
starting the driver. This file
    +    must be located on the submitting machine's disk. Specify this as a path as opposed
to a URI (i.e. do not provide
    +    a scheme).
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.submission.clientKeyFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the client key file for authenticating against the Kubernetes API server
when starting the driver. This file
    +    must be located on the submitting machine's disk. Specify this as a path as opposed
to a URI (i.e. do not provide
    +    a scheme).
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.submission.clientCertFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the client cert file for authenticating against the Kubernetes API server
when starting the driver. This
    +    file must be located on the submitting machine's disk. Specify this as a path as
opposed to a URI (i.e. do not
    +    provide a scheme).
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.submission.oauthToken</code></td>
    +  <td>(none)</td>
    +  <td>
    +    OAuth token to use when authenticating against the Kubernetes API server when starting
the driver. Note
    +    that unlike the other authentication options, this is expected to be the exact string
value of the token to use for
    +    the authentication.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.driver.caCertFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the CA cert file for connecting to the Kubernetes API server over TLS from
the driver pod when requesting
    +    executors. This file must be located on the submitting machine's disk, and will be
uploaded to the driver pod.
    +    Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.driver.clientKeyFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the client key file for authenticating against the Kubernetes API server
from the driver pod when requesting
    +    executors. This file must be located on the submitting machine's disk, and will be
uploaded to the driver pod.
    +    Specify this as a path as opposed to a URI (i.e. do not provide a scheme). If this
is specified, it is highly
    +    recommended to set up TLS for the driver submission server, as this value is sensitive
information that would be
    +    passed to the driver pod in plaintext otherwise.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.driver.clientCertFile</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Path to the client cert file for authenticating against the Kubernetes API server
from the driver pod when
    +    requesting executors. This file must be located on the submitting machine's disk,
and will be uploaded to the
    +    driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.driver.oauthToken</code></td>
    +  <td>(none)</td>
    +  <td>
    +    OAuth token to use when authenticating against the against the Kubernetes API server
from the driver pod when
    +    requesting executors. Note that unlike the other authentication options, this must
be the exact string value of
    +    the token to use for the authentication. This token value is uploaded to the driver
pod. If this is specified, it is
    +    highly recommended to set up TLS for the driver submission server, as this value
is sensitive information that would
    +    be passed to the driver pod in plaintext otherwise.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.authenticate.driver.serviceAccountName</code></td>
    +  <td><code>default</code></td>
    +  <td>
    +    Service account that is used when running the driver pod. The driver pod uses this
service account when requesting
    +    executor pods from the API server. Note that this cannot be specified alongside a
CA cert file, client key file,
    +    client cert file, and/or OAuth token.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.driver.label.[LabelName]</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Add the label specified by <code>LabelName</code> to the driver pod.
    +    For example, <code>spark.kubernetes.driver.label.something=true</code>.
    +    Note that Spark also adds its own labels to the driver pod
    +    for bookkeeping purposes.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.driver.annotation.[AnnotationName]</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Add the annotation specified by <code>AnnotationName</code> to the driver
pod.
    +    For example, <code>spark.kubernetes.driver.annotation.something=true</code>.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.executor.label.[LabelName]</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Add the label specified by <code>LabelName</code> to the executor pods.
    +    For example, <code>spark.kubernetes.executor.label.something=true</code>.
    +    Note that Spark also adds its own labels to the driver pod
    +    for bookkeeping purposes.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.executor.annotation.[AnnotationName]</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Add the annotation specified by <code>AnnotationName</code> to the executor
pods.
    +    For example, <code>spark.kubernetes.executor.annotation.something=true</code>.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.driver.pod.name</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Name of the driver pod. If not set, the driver pod name is set to "spark.app.name"
suffixed by the current timestamp
    +    to avoid name conflicts.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.kubernetes.submission.waitAppCompletion</code></td>
    +  <td><code>true</code></td>
    +  <td>
    +    In cluster mode, whether to wait for the application to finish before exiting the
launcher process.  When changed to
    +    false, the launcher has a "fire-and-forget" behavior when launching the Spark job.
    +  </td>
    --- End diff --
    
    No, it wouldn't. It would simply stop spark-submit from continuing to watch for status
from the cluster. The application runs the same in either case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message