beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [2/3] incubator-beam-site git commit: Regenerate website
Date Tue, 15 Nov 2016 01:11:03 GMT
Regenerate website


Branch: refs/heads/asf-site
Commit: 832d2abe2aa43c5c0a14366dae574f01f57f4f0d
Parents: 5fbc7b7
Author: Davor Bonaci <>
Authored: Mon Nov 14 17:10:54 2016 -0800
Committer: Davor Bonaci <>
Committed: Mon Nov 14 17:10:54 2016 -0800

 .../documentation/runners/dataflow/index.html   | 128 ++++++++++++++++++-
 1 file changed, 126 insertions(+), 2 deletions(-)
diff --git a/content/documentation/runners/dataflow/index.html b/content/documentation/runners/dataflow/index.html
index aa403ec..507be0b 100644
--- a/content/documentation/runners/dataflow/index.html
+++ b/content/documentation/runners/dataflow/index.html
@@ -140,9 +140,133 @@
     <div class="container" role="main">
       <div class="row">
-        <h1 id="using-the-cloud-dataflow-runner">Using the Cloud Dataflow Runner</h1>
+        <h1 id="using-the-google-cloud-dataflow-runner">Using the Google Cloud Dataflow
-<p>This page is under construction (<a href="">BEAM-508</a>).</p>
+<p>The Google Cloud Dataflow Runner uses the <a href="">Cloud
Dataflow managed service</a>. When you run your pipeline with the Cloud Dataflow service,
the runner uploads your executable code and dependencies to a Google Cloud Storage bucket
and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google
Cloud Platform.</p>
+<p>The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs,
and provide:</p>
+  <li>a fully managed service</li>
+  <li><a href="">autoscaling</a>
of the number of workers throughout the lifetime of the job</li>
+  <li><a href="">dynamic
work rebalancing</a></li>
+<p>The <a href="/documentation/runners/capability-matrix/">Beam Capability Matrix</a>
documents the supported capabilities of the Cloud Dataflow Runner.</p>
+<h2 id="cloud-dataflow-runner-prerequisites-and-setup">Cloud Dataflow Runner prerequisites
and setup</h2>
+<p>To use the Cloud Dataflow Runner, you must complete the following setup:</p>
+  <li>
+    <p>Select or create a Google Cloud Platform Console project.</p>
+  </li>
+  <li>
+    <p>Enable billing for your project.</p>
+  </li>
+  <li>
+    <p>Enable required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver
Logging, Cloud Storage, and Cloud Storage JSON. You may need to enable additional APIs (such
as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.</p>
+  </li>
+  <li>
+    <p>Install the Google Cloud SDK.</p>
+  </li>
+  <li>
+    <p>Create a Cloud Storage bucket.</p>
+    <ul>
+      <li>In the Google Cloud Platform Console, go to the Cloud Storage browser.</li>
+      <li>Click <strong>Create bucket</strong>.</li>
+      <li>In the <strong>Create bucket</strong> dialog, specify the following
+        <ul>
+          <li><em>Name</em>: A unique bucket name. Do not include sensitive
information in the bucket name, as the bucket namespace is global and publicly visible.</li>
+          <li><em>Storage class</em>: Multi-Regional</li>
+          <li><em>Location</em>:  Choose your desired location</li>
+        </ul>
+      </li>
+      <li>Click <strong>Create</strong>.</li>
+    </ul>
+  </li>
+<p>For more information, see the <em>Before you begin</em> section of the
<a href="">Cloud Dataflow quickstarts</a>.</p>
+<h3 id="specify-your-dependency">Specify your dependency</h3>
+<p>You must specify your dependency on the Cloud Dataflow Runner.</p>
+<div class="language-java highlighter-rouge"><pre class="highlight"><code><span
class="o">&lt;</span><span class="n">dependency</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span
class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span
class="na">apache</span><span class="o">.</span><span class="na">beam</span><span
class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span
class="o">&gt;</span><span class="n">beam</span><span class="o">-</span><span
class="n">runners</span><span class="o">-</span><span class="n">google</span><span
class="o">-</span><span class="n">cloud</span><span class="o">-</span><span
class="n">dataflow</span><span class="o">-</span><span class="n">java</span><span
class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">version</span><span
class="o">&gt;</span><span class="mf">0.3</span><span class="o">.</span><span
class="mi">0</span><span class="o">-</span><span class="n">incubating</span><span
class="o">&lt;/</span><span class="n">version</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">scope</span><span
class="o">&gt;</span><span class="n">runtime</span><span class="o">&lt;/</span><span
class="n">scope</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">dependency</span><span
+<h3 id="authentication">Authentication</h3>
+<p>Before running your pipeline, you must authenticate with the Google Cloud Platform.
Run the following command to get <a href="">Application
Default Credentials</a>.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>gcloud auth
application-default login
+<h2 id="pipeline-options-for-the-cloud-dataflow-runner">Pipeline options for the Cloud
Dataflow Runner</h2>
+<p>When executing your pipeline with the Cloud Dataflow Runner, set these pipeline
+<table class="table table-bordered">
+  <th>Field</th>
+  <th>Description</th>
+  <th>Default Value</th>
+  <td><code>runner</code></td>
+  <td>The pipeline runner to use. This option allows you to determine the pipeline
runner at runtime.</td>
+  <td>Set to <code>dataflow</code> to run on the Cloud Dataflow Service.</td>
+  <td><code>project</code></td>
+  <td>The project ID for your Google Cloud Project.</td>
+  <td>If not set, defaults to the default project in the current environment. The default
project is set via <code>gcloud</code>.</td>
+  <td><code>streaming</code></td>
+  <td>Whether streaming mode is enabled or disabled; <code>true</code>
if enabled. Set to <code>true</code> if running pipelines with unbounded <code>PCollection</code>s.</td>
+  <td><code>false</code></td>
+  <td><code>tempLocation</code></td>
+  <td>Optional. Path for temporary files. If set to a valid Google Cloud Storage URL
that begins with <code>gs://</code>, <code>tempLocation</code> is
used as the default value for <code>gcpTempLocation</code>.</td>
+  <td>No default value.</td>
+  <td><code>gcpTempLocation</code></td>
+  <td>Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage
URL that begins with <code>gs://</code>.</td>
+  <td>If not set, defaults to the value of <code>tempLocation</code>, provided
that <code>tempLocation</code> is a valid Cloud Storage URL. If <code>tempLocation</code>
is not a valid Cloud Storage URL, you must set <code>gcpTempLocation</code>.</td>
+  <td><code>stagingLocation</code></td>
+  <td>Optional. Cloud Storage bucket path for staging your binary and any temporary
files. Must be a valid Cloud Storage URL that begins with <code>gs://</code>.</td>
+  <td>If not set, defaults to a staging directory within <code>gcpTempLocation</code>.</td>
+<p>See the reference documentation for the  <span class="language-java"><a
class="language-python"><a href="">PipelineOptions</a></span>
interface (and its subinterfaces) for the complete list of pipeline configuration options.</p>
+<h2 id="additional-information-and-caveats">Additional information and caveats</h2>
+<h3 id="monitoring-your-job">Monitoring your job</h3>
+<p>While your pipeline executes, you can monitor the job’s progress, view details
on execution, and receive updates on the pipeline’s results by using the <a href="">Dataflow
Monitoring Interface</a> or the <a href="">Dataflow
Command-line Interface</a>.</p>
+<h3 id="blocking-execution">Blocking Execution</h3>
+<p>To connect to your job and block until it is completed, call <code class="highlighter-rouge">waitToFinish</code>
on the <code class="highlighter-rouge">PipelineResult</code> returned from <code
class="highlighter-rouge"></code>. The Cloud Dataflow Runner prints
job status updates and console messages while it waits. While the result is connected to the
active job, note that pressing <strong>Ctrl+C</strong> from the command line does
not cancel your job. To cancel the job, you can use the <a href="">Dataflow
Monitoring Interface</a> or the <a href="">Dataflow
Command-line Interface</a>.</p>
+<h3 id="streaming-execution">Streaming Execution</h3>
+<p>If your pipeline uses an unbounded data source or sink, you must set the <code
class="highlighter-rouge">streaming</code> option to <code class="highlighter-rouge">true</code>.</p>

View raw message