Mailing-List: contact commits-help@beam.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@beam.incubator.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: davor@apache.org
To: commits@beam.incubator.apache.org
Date: Fri, 09 Dec 2016 20:49:39 -0000
Message-Id: <f0ddffd338cb40649d3ba6c65c637d2f@git.apache.org>
In-Reply-To: <df537225a0134f5d969d2b187f201dd5@git.apache.org>
References: <df537225a0134f5d969d2b187f201dd5@git.apache.org>
Subject: [2/3] incubator-beam-site git commit: Regenerate website
archived-at: Fri, 09 Dec 2016 20:49:45 -0000

Regenerate website


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/6b76c3f4
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/6b76c3f4
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/6b76c3f4

Branch: refs/heads/asf-site
Commit: 6b76c3f4e0b15236d433909732453e95b635ce87
Parents: 7a9190c
Author: Davor Bonaci <davor@google.com>
Authored: Fri Dec 9 12:49:30 2016 -0800
Committer: Davor Bonaci <davor@google.com>
Committed: Fri Dec 9 12:49:30 2016 -0800

----------------------------------------------------------------------
 content/documentation/runners/spark/index.html | 157 +++++++++++++++++++-
 1 file changed, 156 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/6b76c3f4/content/documentation/runners/spark/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/runners/spark/index.html b/content/documentation/runners/spark/index.html
index 521cf5d..23960e0 100644
--- a/content/documentation/runners/spark/index.html
+++ b/content/documentation/runners/spark/index.html
@@ -146,7 +146,162 @@
       <div class="row">
         <h1 id="using-the-apache-spark-runner">Using the Apache Spark Runner</h1>
 
-<p>This page is under construction (<a href="https://issues.apache.org/jira/browse/BEAM-507">BEAM-507</a>).</p>
+<p>The Apache Spark Runner can be used to execute Beam pipelines using <a href="http://spark.apache.org/">Apache Spark</a>. 
+The Spark Runner can execute Spark pipelines just like a native Spark application; deploying a self-contained application for local mode, running on Spark’s Standalone RM, or using YARN or Mesos.</p>
+
+<p>The Spark Runner executes Beam pipelines on top of Apache Spark, providing:</p>
+
+<ul>
+  <li>Batch and streaming (and combined) pipelines.</li>
+  <li>The same fault-tolerance <a href="http://spark.apache.org/docs/1.6.3/streaming-programming-guide.html#fault-tolerance-semantics">guarantees</a> as provided by RDDs and DStreams.</li>
+  <li>The same <a href="http://spark.apache.org/docs/1.6.3/security.html">security</a> features Spark provides.</li>
+  <li>Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.</li>
+  <li>Native support for Beam side-inputs via spark’s Broadcast variables.</li>
+</ul>
+
+<p>The <a href="/documentation/runners/capability-matrix/">Beam Capability Matrix</a> documents the currently supported capabilities of the Spark Runner.</p>
+
+<p><em><strong>Note:</strong></em> <em>support for the Beam Model in streaming is currently experimental, follow-up in the <a href="/get-started/support/">mailing list</a> for status updates.</em></p>
+
+<h2 id="spark-runner-prerequisites-and-setup">Spark Runner prerequisites and setup</h2>
+
+<p>The Spark runner currently supports Spark’s 1.6 branch, and more specifically any version greater than 1.6.0.</p>
+
+<p>You can add a dependency on the latest version of the Spark runner by adding to your pom.xml the following:</p>
+<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o">&lt;</span><span class="n">dependency</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">beam</span><span class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span class="o">&gt;</span><span class="n">beam</span><span class="o">-</span><span class="n">runners</span><span class="o">-</span><span class="n">spark</span><span class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">version</span><span class="o">&gt;</span><span class="mf">0.3</span><span class="o">.</span><span class="mi">0</span><span class="o">-</span><span class="n">incubating</span><span class="o">&lt;/</span><span class="n">version</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">dependency</span><span class="o">&gt;</span>
+</code></pre>
+</div>
+
+<h3 id="deploying-spark-with-your-application">Deploying Spark with your application</h3>
+
+<p>In some cases, such as running in local mode/Standalone, your (self-contained) application would be required to pack Spark by explicitly adding the following dependencies in your pom.xml:</p>
+<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o">&lt;</span><span class="n">dependency</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span class="o">&gt;</span><span class="n">spark</span><span class="o">-</span><span class="n">core_2</span><span class="o">.</span><span class="mi">10</span><span class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">version</span><span class="o">&gt;</span><span class="err">$</span><span class="o">{</span><span class="n">spark</span><span class="o">.</span><span class="na">version</span><span class="o">}&lt;/</span><span class="n">version</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">dependency</span><span class="o">&gt;</span>
+
+<span class="o">&lt;</span><span class="n">dependency</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span class="o">&gt;</span><span class="n">spark</span><span class="o">-</span><span class="n">streaming_2</span><span class="o">.</span><span class="mi">10</span><span class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">version</span><span class="o">&gt;</span><span class="err">$</span><span class="o">{</span><span class="n">spark</span><span class="o">.</span><span class="na">version</span><span class="o">}&lt;/</span><span class="n">version</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">dependency</span><span class="o">&gt;</span>
+</code></pre>
+</div>
+
+<p>And shading the application jar using the maven shade plugin:</p>
+<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o">&lt;</span><span class="n">plugin</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">maven</span><span class="o">.</span><span class="na">plugins</span><span class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span class="o">&gt;</span><span class="n">maven</span><span class="o">-</span><span class="n">shade</span><span class="o">-</span><span class="n">plugin</span><span class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">configuration</span><span class="o">&gt;</span>
+    <span class="o">&lt;</span><span class="n">createDependencyReducedPom</span><span class="o">&gt;</span><span class="kc">false</span><span class="o">&lt;/</span><span class="n">createDependencyReducedPom</span><span class="o">&gt;</span>
+    <span class="o">&lt;</span><span class="n">filters</span><span class="o">&gt;</span>
+      <span class="o">&lt;</span><span class="n">filter</span><span class="o">&gt;</span>
+        <span class="o">&lt;</span><span class="n">artifact</span><span class="o">&gt;*:*&lt;/</span><span class="n">artifact</span><span class="o">&gt;</span>
+        <span class="o">&lt;</span><span class="n">excludes</span><span class="o">&gt;</span>
+          <span class="o">&lt;</span><span class="n">exclude</span><span class="o">&gt;</span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">SF</span><span class="o">&lt;/</span><span class="n">exclude</span><span class="o">&gt;</span>
+          <span class="o">&lt;</span><span class="n">exclude</span><span class="o">&gt;</span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">DSA</span><span class="o">&lt;/</span><span class="n">exclude</span><span class="o">&gt;</span>
+          <span class="o">&lt;</span><span class="n">exclude</span><span class="o">&gt;</span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">RSA</span><span class="o">&lt;/</span><span class="n">exclude</span><span class="o">&gt;</span>
+        <span class="o">&lt;/</span><span class="n">excludes</span><span class="o">&gt;</span>
+      <span class="o">&lt;/</span><span class="n">filter</span><span class="o">&gt;</span>
+    <span class="o">&lt;/</span><span class="n">filters</span><span class="o">&gt;</span>
+  <span class="o">&lt;/</span><span class="n">configuration</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">executions</span><span class="o">&gt;</span>
+    <span class="o">&lt;</span><span class="n">execution</span><span class="o">&gt;</span>
+      <span class="o">&lt;</span><span class="n">phase</span><span class="o">&gt;</span><span class="kn">package</span><span class="o">&lt;/</span><span class="n">phase</span><span class="o">&gt;</span>
+      <span class="o">&lt;</span><span class="n">goals</span><span class="o">&gt;</span>
+        <span class="o">&lt;</span><span class="n">goal</span><span class="o">&gt;</span><span class="n">shade</span><span class="o">&lt;/</span><span class="n">goal</span><span class="o">&gt;</span>
+      <span class="o">&lt;/</span><span class="n">goals</span><span class="o">&gt;</span>
+      <span class="o">&lt;</span><span class="n">configuration</span><span class="o">&gt;</span>
+        <span class="o">&lt;</span><span class="n">shadedArtifactAttached</span><span class="o">&gt;</span><span class="kc">true</span><span class="o">&lt;/</span><span class="n">shadedArtifactAttached</span><span class="o">&gt;</span>
+        <span class="o">&lt;</span><span class="n">shadedClassifierName</span><span class="o">&gt;</span><span class="n">shaded</span><span class="o">&lt;/</span><span class="n">shadedClassifierName</span><span class="o">&gt;</span>
+      <span class="o">&lt;/</span><span class="n">configuration</span><span class="o">&gt;</span>
+    <span class="o">&lt;/</span><span class="n">execution</span><span class="o">&gt;</span>
+  <span class="o">&lt;/</span><span class="n">executions</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">plugin</span><span class="o">&gt;</span>
+</code></pre>
+</div>
+
+<p>After running <code>mvn package</code>, run <code>ls target</code> and you should see (assuming your artifactId is <code class="highlighter-rouge">beam-examples</code> and the version is <code class="highlighter-rouge">1.0.0</code>):</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>beam-examples-1.0.0-shaded.jar
+</code></pre>
+</div>
+
+<p>To run against a Standalone cluster simply run:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner
+</code></pre>
+</div>
+
+<h3 id="running-on-a-pre-deployed-spark-cluster">Running on a pre-deployed Spark cluster</h3>
+
+<p>Deploying your Beam pipeline on a cluster that already has a Spark deployment (Spark classes are available in container classpath) does not require any additional dependencies.
+For more details on the different deployment modes see: <a href="http://spark.apache.org/docs/latest/spark-standalone.html">Standalone</a>, <a href="http://spark.apache.org/docs/latest/running-on-yarn.html">YARN</a>, or <a href="http://spark.apache.org/docs/latest/running-on-mesos.html">Mesos</a>.</p>
+
+<h2 id="pipeline-options-for-the-spark-runner">Pipeline options for the Spark Runner</h2>
+
+<p>When executing your pipeline with the Spark Runner, you should consider the following pipeline options.</p>
+
+<table class="table table-bordered">
+<tr>
+  <th>Field</th>
+  <th>Description</th>
+  <th>Default Value</th>
+</tr>
+<tr>
+  <td><code>runner</code></td>
+  <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td>
+  <td>Set to <code>SparkRunner</code> to run using Spark.</td>
+</tr>
+<tr>
+  <td><code>sparkMaster</code></td>
+  <td>The url of the Spark Master. This is the equivalent of setting <code>SparkConf#setMaster(String)</code> and can either be <code>local[x]</code> to run local with x cores, <code>spark://host:port</code> to connect to a Spark Standalone cluster, <code>mesos://host:port</code> to connect to a Mesos cluster, or <code>yarn</code> to connect to a yarn cluster.</td>
+  <td><code>local[4]</code></td>
+</tr>
+<tr>
+  <td><code>storageLevel</code></td>
+  <td>The <code>StorageLevel</code> to use when caching RDDs in batch pipelines. The Spark Runner automatically caches RDDs that are evaluated repeatedly. This is a batch-only property as streaming pipelines in Beam are stateful, which requires Spark DStream's <code>StorageLevel</code> to be <code>MEMORY_ONLY</code>.</td>
+  <td>MEMORY_ONLY</td>
+</tr>
+<tr>
+  <td><code>batchIntervalMillis</code></td>
+  <td>The <code>StreamingContext</code>'s <code>batchDuration</code> - setting Spark's batch interval.</td>
+  <td><code>1000</code></td>
+</tr>
+<tr>
+  <td><code>enableSparkMetricSinks</code></td>
+  <td>Enable reporting metrics to Spark's metrics Sinks.</td>
+  <td>true</td>
+</tr>
+</table>
+
+<h2 id="additional-notes">Additional notes</h2>
+
+<h3 id="using-spark-submit">Using spark-submit</h3>
+
+<p>When submitting a Spark application to cluster, it is common (and recommended) to use the <code>spark-submit</code> script that is provided with the spark installation.
+The <code>PipelineOptions</code> described above are not to replace <code>spark-submit</code>, but to complement it.
+Passing any of the above mentioned options could be done as one of the <code>application-arguments</code>, and setting <code>--master</code> takes precedence.
+For more on how to generally use <code>spark-submit</code> checkout Spark <a href="http://spark.apache.org/docs/1.6.3/submitting-applications.html#launching-applications-with-spark-submit">documentation</a>.</p>
+
+<h3 id="monitoring-your-job">Monitoring your job</h3>
+
+<p>You can monitor a running Spark job using the Spark <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#web-interfaces">Web Interfaces</a>. By default, this is available at port <code class="highlighter-rouge">4040</code> on the driver node. If you run Spark on your local machine that would be <code class="highlighter-rouge">http://localhost:4040</code>.
+Spark also has a history server to <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#viewing-after-the-fact">view after the fact</a>.
+Metrics are also available via <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#rest-api">REST API</a>.
+Spark provides a <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#metrics">metrics system</a> that allows reporting Spark metrics to a variety of Sinks. The Spark runner reports user-defined Beam Aggregators using this same metrics system and currently supports <code>GraphiteSink</code> and <code>CSVSink</code>, and providing support for additional Sinks supported by Spark is easy and straight-forward.</p>
+
+<h3 id="streaming-execution">Streaming Execution</h3>
+
+<p>If your pipeline uses an <code>UnboundedSource</code> the Spark Runner will automatically set streaming mode. Forcing streaming mode is mostly used for testing and is not recommended.</p>
+
+<h3 id="using-a-provided-sparkcontext-and-streaminglisteners">Using a provided SparkContext and StreamingListeners</h3>
+
+<p>If you would like to execute your Spark job with a provided <code>SparkContext</code>, such as when using the <a href="https://github.com/spark-jobserver/spark-jobserver">spark-jobserver</a>, or use <code>StreamingListeners</code>, you can’t use <code>SparkPipelineOptions</code> (the context or a listener cannot be passed as a command-line argument anyway).
+Instead, you should use <code>SparkContextOptions</code> which can only be used programmatically and is not a common <code>PipelineOptions</code> implementation.</p>
 
       </div>