Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 569BC200BD9 for ; Fri, 9 Dec 2016 21:49:45 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 55445160B1D; Fri, 9 Dec 2016 20:49:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 27A2B160B1E for ; Fri, 9 Dec 2016 21:49:44 +0100 (CET) Received: (qmail 51159 invoked by uid 500); 9 Dec 2016 20:49:43 -0000 Mailing-List: contact commits-help@beam.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.incubator.apache.org Delivered-To: mailing list commits@beam.incubator.apache.org Received: (qmail 51090 invoked by uid 99); 9 Dec 2016 20:49:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2016 20:49:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E0545C795E for ; Fri, 9 Dec 2016 20:49:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.219 X-Spam-Level: X-Spam-Status: No, score=-6.219 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id fd1nY-cbkeDB for ; Fri, 9 Dec 2016 20:49:40 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id A5D525FB69 for ; Fri, 9 Dec 2016 20:49:39 +0000 (UTC) Received: (qmail 50877 invoked by uid 99); 9 Dec 2016 20:49:38 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2016 20:49:38 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 8BA8BE36DC; Fri, 9 Dec 2016 20:49:38 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: davor@apache.org To: commits@beam.incubator.apache.org Date: Fri, 09 Dec 2016 20:49:39 -0000 Message-Id: In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [2/3] incubator-beam-site git commit: Regenerate website archived-at: Fri, 09 Dec 2016 20:49:45 -0000 Regenerate website Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/6b76c3f4 Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/6b76c3f4 Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/6b76c3f4 Branch: refs/heads/asf-site Commit: 6b76c3f4e0b15236d433909732453e95b635ce87 Parents: 7a9190c Author: Davor Bonaci Authored: Fri Dec 9 12:49:30 2016 -0800 Committer: Davor Bonaci Committed: Fri Dec 9 12:49:30 2016 -0800 ---------------------------------------------------------------------- content/documentation/runners/spark/index.html | 157 +++++++++++++++++++- 1 file changed, 156 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/6b76c3f4/content/documentation/runners/spark/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/runners/spark/index.html b/content/documentation/runners/spark/index.html index 521cf5d..23960e0 100644 --- a/content/documentation/runners/spark/index.html +++ b/content/documentation/runners/spark/index.html @@ -146,7 +146,162 @@

Using the Apache Spark Runner

-

This page is under construction (BEAM-507).

+

The Apache Spark Runner can be used to execute Beam pipelines using Apache Spark. +The Spark Runner can execute Spark pipelines just like a native Spark application; deploying a self-contained application for local mode, running on Spark’s Standalone RM, or using YARN or Mesos.

+ +

The Spark Runner executes Beam pipelines on top of Apache Spark, providing:

+ +
    +
  • Batch and streaming (and combined) pipelines.
  • +
  • The same fault-tolerance guarantees as provided by RDDs and DStreams.
  • +
  • The same security features Spark provides.
  • +
  • Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.
  • +
  • Native support for Beam side-inputs via spark’s Broadcast variables.
  • +
+ +

The Beam Capability Matrix documents the currently supported capabilities of the Spark Runner.

+ +

Note: support for the Beam Model in streaming is currently experimental, follow-up in the mailing list for status updates.

+ +

Spark Runner prerequisites and setup

+ +

The Spark runner currently supports Spark’s 1.6 branch, and more specifically any version greater than 1.6.0.

+ +

You can add a dependency on the latest version of the Spark runner by adding to your pom.xml the following:

+
<dependency>
+  <groupId>org.apache.beam</groupId>
+  <artifactId>beam-runners-spark</artifactId>
+  <version>0.3.0-incubating</version>
+</dependency>
+
+
+ +

Deploying Spark with your application

+ +

In some cases, such as running in local mode/Standalone, your (self-contained) application would be required to pack Spark by explicitly adding the following dependencies in your pom.xml:

+
<dependency>
+  <groupId>org.apache.spark</groupId>
+  <artifactId>spark-core_2.10</artifactId>
+  <version>${spark.version}</version>
+</dependency>
+
+<dependency>
+  <groupId>org.apache.spark</groupId>
+  <artifactId>spark-streaming_2.10</artifactId>
+  <version>${spark.version}</version>
+</dependency>
+
+
+ +

And shading the application jar using the maven shade plugin:

+
<plugin>
+  <groupId>org.apache.maven.plugins</groupId>
+  <artifactId>maven-shade-plugin</artifactId>
+  <configuration>
+    <createDependencyReducedPom>false</createDependencyReducedPom>
+    <filters>
+      <filter>
+        <artifact>*:*</artifact>
+        <excludes>
+          <exclude>META-INF/*.SF</exclude>
+          <exclude>META-INF/*.DSA</exclude>
+          <exclude>META-INF/*.RSA</exclude>
+        </excludes>
+      </filter>
+    </filters>
+  </configuration>
+  <executions>
+    <execution>
+      <phase>package</phase>
+      <goals>
+        <goal>shade</goal>
+      </goals>
+      <configuration>
+        <shadedArtifactAttached>true</shadedArtifactAttached>
+        <shadedClassifierName>shaded</shadedClassifierName>
+      </configuration>
+    </execution>
+  </executions>
+</plugin>
+
+
+ +

After running mvn package, run ls target and you should see (assuming your artifactId is beam-examples and the version is 1.0.0):

+
beam-examples-1.0.0-shaded.jar
+
+
+ +

To run against a Standalone cluster simply run:

+
spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner
+
+
+ +

Running on a pre-deployed Spark cluster

+ +

Deploying your Beam pipeline on a cluster that already has a Spark deployment (Spark classes are available in container classpath) does not require any additional dependencies. +For more details on the different deployment modes see: Standalone, YARN, or Mesos.

+ +

Pipeline options for the Spark Runner

+ +

When executing your pipeline with the Spark Runner, you should consider the following pipeline options.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldDescriptionDefault Value
runnerThe pipeline runner to use. This option allows you to determine the pipeline runner at runtime.Set to SparkRunner to run using Spark.
sparkMasterThe url of the Spark Master. This is the equivalent of setting SparkConf#setMaster(String) and can either be local[x] to run local with x cores, spark://host:port to connect to a Spark Standalone cluster, mesos://host:port to connect to a Mesos cluster, or yarn to connect to a yarn cluster.local[4]
storageLevelThe StorageLevel to use when caching RDDs in batch pipelines. The Spark Runner automatically caches RDDs that are evaluated repeatedly. This is a batch-only property as streaming pipelines in Beam are stateful, which requires Spark DStream's StorageLevel to be MEMORY_ONLY.MEMORY_ONLY
batchIntervalMillisThe StreamingContext's batchDuration - setting Spark's batch interval.1000
enableSparkMetricSinksEnable reporting metrics to Spark's metrics Sinks.true
+ +

Additional notes

+ +

Using spark-submit

+ +

When submitting a Spark application to cluster, it is common (and recommended) to use the spark-submit script that is provided with the spark installation. +The PipelineOptions described above are not to replace spark-submit, but to complement it. +Passing any of the above mentioned options could be done as one of the application-arguments, and setting --master takes precedence. +For more on how to generally use spark-submit checkout Spark documentation.

+ +

Monitoring your job

+ +

You can monitor a running Spark job using the Spark Web Interfaces. By default, this is available at port 4040 on the driver node. If you run Spark on your local machine that would be http://localhost:4040. +Spark also has a history server to view after the fact. +Metrics are also available via REST API. +Spark provides a metrics system that allows reporting Spark metrics to a variety of Sinks. The Spark runner reports user-defined Beam Aggregators using this same metrics system and currently supports GraphiteSink and CSVSink, and providing support for additional Sinks supported by Spark is easy and straight-forward.

+ +

Streaming Execution

+ +

If your pipeline uses an UnboundedSource the Spark Runner will automatically set streaming mode. Forcing streaming mode is mostly used for testing and is not recommended.

+ +

Using a provided SparkContext and StreamingListeners

+ +

If you would like to execute your Spark job with a provided SparkContext, such as when using the spark-jobserver, or use StreamingListeners, you can’t use SparkPipelineOptions (the context or a listener cannot be passed as a command-line argument anyway). +Instead, you should use SparkContextOptions which can only be used programmatically and is not a common PipelineOptions implementation.