beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [1/3] incubator-beam-site git commit: [BEAM-508] Fill in the documentation/runners/dataflow portion of the website
Date Tue, 15 Nov 2016 01:11:02 GMT
Repository: incubator-beam-site
Updated Branches:
  refs/heads/asf-site a82a0f3bb -> d5b722e70

[BEAM-508] Fill in the documentation/runners/dataflow portion of the website


Branch: refs/heads/asf-site
Commit: 5fbc7b764d224b58d71ef43c52dff438fe1ddc6d
Parents: a82a0f3
Author: melissa <>
Authored: Fri Nov 11 10:57:28 2016 -0800
Committer: Davor Bonaci <>
Committed: Mon Nov 14 17:10:18 2016 -0800

 src/documentation/runners/ | 113 ++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 2 deletions(-)
diff --git a/src/documentation/runners/ b/src/documentation/runners/
index c49223b..57dec4c 100644
--- a/src/documentation/runners/
+++ b/src/documentation/runners/
@@ -4,6 +4,115 @@ title: "Cloud Dataflow Runner"
 permalink: /documentation/runners/dataflow/
 redirect_from: /learn/runners/dataflow/
-# Using the Cloud Dataflow Runner
+# Using the Google Cloud Dataflow Runner
-This page is under construction ([BEAM-508](
+The Google Cloud Dataflow Runner uses the [Cloud Dataflow managed service](
When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable
code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which
executes your pipeline on managed resources in Google Cloud Platform.
+The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and
+* a fully managed service
+* [autoscaling](
of the number of workers throughout the lifetime of the job
+* [dynamic work rebalancing](
+The [Beam Capability Matrix]({{ site.baseurl }}/documentation/runners/capability-matrix/)
documents the supported capabilities of the Cloud Dataflow Runner.
+## Cloud Dataflow Runner prerequisites and setup
+To use the Cloud Dataflow Runner, you must complete the following setup:
+1. Select or create a Google Cloud Platform Console project.
+2. Enable billing for your project.
+3. Enable required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging,
Cloud Storage, and Cloud Storage JSON. You may need to enable additional APIs (such as BigQuery,
Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
+4. Install the Google Cloud SDK.
+5. Create a Cloud Storage bucket.
+    * In the Google Cloud Platform Console, go to the Cloud Storage browser.
+    * Click **Create bucket**.
+    * In the **Create bucket** dialog, specify the following attributes:
+      * _Name_: A unique bucket name. Do not include sensitive information in the bucket
name, as the bucket namespace is global and publicly visible.
+      * _Storage class_: Multi-Regional
+      * _Location_:  Choose your desired location
+    * Click **Create**.
+For more information, see the *Before you begin* section of the [Cloud Dataflow quickstarts](
+### Specify your dependency
+You must specify your dependency on the Cloud Dataflow Runner.
+  <groupId>org.apache.beam</groupId>
+  <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
+  <version>{{ site.release_latest }}</version>
+  <scope>runtime</scope>
+### Authentication
+Before running your pipeline, you must authenticate with the Google Cloud Platform. Run the
following command to get [Application Default Credentials](
+gcloud auth application-default login
+## Pipeline options for the Cloud Dataflow Runner
+When executing your pipeline with the Cloud Dataflow Runner, set these pipeline options.
+<table class="table table-bordered">
+  <th>Field</th>
+  <th>Description</th>
+  <th>Default Value</th>
+  <td><code>runner</code></td>
+  <td>The pipeline runner to use. This option allows you to determine the pipeline
runner at runtime.</td>
+  <td>Set to <code>dataflow</code> to run on the Cloud Dataflow Service.</td>
+  <td><code>project</code></td>
+  <td>The project ID for your Google Cloud Project.</td>
+  <td>If not set, defaults to the default project in the current environment. The default
project is set via <code>gcloud</code>.</td>
+  <td><code>streaming</code></td>
+  <td>Whether streaming mode is enabled or disabled; <code>true</code>
if enabled. Set to <code>true</code> if running pipelines with unbounded <code>PCollection</code>s.</td>
+  <td><code>false</code></td>
+  <td><code>tempLocation</code></td>
+  <td>Optional. Path for temporary files. If set to a valid Google Cloud Storage URL
that begins with <code>gs://</code>, <code>tempLocation</code> is
used as the default value for <code>gcpTempLocation</code>.</td>
+  <td>No default value.</td>
+  <td><code>gcpTempLocation</code></td>
+  <td>Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage
URL that begins with <code>gs://</code>.</td>
+  <td>If not set, defaults to the value of <code>tempLocation</code>, provided
that <code>tempLocation</code> is a valid Cloud Storage URL. If <code>tempLocation</code>
is not a valid Cloud Storage URL, you must set <code>gcpTempLocation</code>.</td>
+  <td><code>stagingLocation</code></td>
+  <td>Optional. Cloud Storage bucket path for staging your binary and any temporary
files. Must be a valid Cloud Storage URL that begins with <code>gs://</code>.</td>
+  <td>If not set, defaults to a staging directory within <code>gcpTempLocation</code>.</td>
+See the reference documentation for the  <span class="language-java">[DataflowPipelineOptions]({{
site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html)</span><span
interface (and its subinterfaces) for the complete list of pipeline configuration options.
+## Additional information and caveats
+### Monitoring your job
+While your pipeline executes, you can monitor the job's progress, view details on execution,
and receive updates on the pipeline's results by using the [Dataflow Monitoring Interface](
or the [Dataflow Command-line Interface](
+### Blocking Execution
+To connect to your job and block until it is completed, call `waitToFinish` on the `PipelineResult`
returned from ``. The Cloud Dataflow Runner prints job status updates and console
messages while it waits. While the result is connected to the active job, note that pressing
**Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use
the [Dataflow Monitoring Interface](
or the [Dataflow Command-line Interface](
+### Streaming Execution
+If your pipeline uses an unbounded data source or sink, you must set the `streaming` option
to `true`.

View raw message