Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2C4E0200C1C for ; Tue, 31 Jan 2017 08:09:24 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2AE55160B5F; Tue, 31 Jan 2017 07:09:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D2FD6160B46 for ; Tue, 31 Jan 2017 08:09:22 +0100 (CET) Received: (qmail 85670 invoked by uid 500); 31 Jan 2017 07:09:21 -0000 Mailing-List: contact commits-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list commits@beam.apache.org Received: (qmail 85657 invoked by uid 99); 31 Jan 2017 07:09:21 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Jan 2017 07:09:21 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id B5E0ADFC61; Tue, 31 Jan 2017 07:09:21 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: davor@apache.org To: commits@beam.apache.org Date: Tue, 31 Jan 2017 07:09:21 -0000 Message-Id: <2e69c2126e264c9f82427576c9830e19@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [1/3] beam-site git commit: Update documentation to remove python-sdk branch references archived-at: Tue, 31 Jan 2017 07:09:24 -0000 Repository: beam-site Updated Branches: refs/heads/asf-site b81afa390 -> 689c36863 Update documentation to remove python-sdk branch references Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/f9eb9fc3 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/f9eb9fc3 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/f9eb9fc3 Branch: refs/heads/asf-site Commit: f9eb9fc3cc6ac62fd8989f1addf7612213e8dbe6 Parents: b81afa3 Author: Ahmet Altay Authored: Mon Jan 30 19:22:50 2017 -0800 Committer: Davor Bonaci Committed: Mon Jan 30 23:06:36 2017 -0800 ---------------------------------------------------------------------- .../2016-10-12-strata-hadoop-world-and-beam.md | 2 +- src/contribute/work-in-progress.md | 1 - src/documentation/programming-guide.md | 38 ++++++++++---------- src/documentation/runners/dataflow.md | 2 +- src/documentation/runners/direct.md | 4 +-- src/documentation/runners/flink.md | 2 +- src/get-started/quickstart-py.md | 4 +-- src/get-started/wordcount-example.md | 24 ++++++------- 8 files changed, 38 insertions(+), 39 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/_posts/2016-10-12-strata-hadoop-world-and-beam.md ---------------------------------------------------------------------- diff --git a/src/_posts/2016-10-12-strata-hadoop-world-and-beam.md b/src/_posts/2016-10-12-strata-hadoop-world-and-beam.md index b78fa4a..4cc9fb4 100644 --- a/src/_posts/2016-10-12-strata-hadoop-world-and-beam.md +++ b/src/_posts/2016-10-12-strata-hadoop-world-and-beam.md @@ -18,7 +18,7 @@ I want to share some of takeaways I had about Beam during the conference. The Data Engineers are looking to Beam as a way to [future-proof](https://www.oreilly.com/ideas/future-proof-and-scale-proof-your-code), meaning that code is portable between the various Big Data frameworks. In fact, many of the attendees were still on Hadoop MapReduce and looking to transition to a new framework. They’re realizing that continually rewriting code isn’t the most productive approach. -Data Scientists are really interested in using Beam. They interested in having a single API for doing analysis instead of several different APIs. We talked about Beam’s progress on the Python API. If you want to take a peek, it’s being actively developed on a [feature branch](https://github.com/apache/beam/tree/python-sdk). As Beam matures, we’re looking to add other supported languages. +Data Scientists are really interested in using Beam. They interested in having a single API for doing analysis instead of several different APIs. We talked about Beam’s progress on the Python API. If you want to take a peek, it’s being actively developed on a [feature branch](https://github.com/apache/beam/tree/master/sdks/python). As Beam matures, we’re looking to add other supported languages. We heard [loud and clear](https://twitter.com/jessetanderson/status/781124173108305920) from Beam users that great runner support is crucial to adoption. We have great Apache Flink support. During the conference we had some more volunteers offer their help on the Spark runner. http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/contribute/work-in-progress.md ---------------------------------------------------------------------- diff --git a/src/contribute/work-in-progress.md b/src/contribute/work-in-progress.md index 258f87c..c3a4d17 100644 --- a/src/contribute/work-in-progress.md +++ b/src/contribute/work-in-progress.md @@ -25,7 +25,6 @@ Current branches include: | Feature | Branch | JIRA Component | More Info | | ---- | ---- | ---- | ---- | | Apache Gearpump Runner | [gearpump-runner](https://github.com/apache/beam/tree/gearpump-runner) | [runner-gearpump](https://issues.apache.org/jira/browse/BEAM/component/12330829) | [README](https://github.com/apache/beam/blob/gearpump-runner/runners/gearpump/README.md) | -| Python SDK | [python-sdk](https://github.com/apache/beam/tree/python-sdk) | [sdk-py](https://issues.apache.org/jira/browse/BEAM/component/12328910) | [README](https://github.com/apache/beam/blob/python-sdk/sdks/python/README.md) | | Apache Spark 2.0 Runner | [runners-spark2](https://github.com/apache/beam/tree/runners-spark2) | - | [thread](https://lists.apache.org/thread.html/e38ac4e4914a6cb1b865b1f32a6ca06c2be28ea4aa0f6b18393de66f@%3Cdev.beam.apache.org%3E) | {:.table} http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/documentation/programming-guide.md ---------------------------------------------------------------------- diff --git a/src/documentation/programming-guide.md b/src/documentation/programming-guide.md index 71ee487..9846929 100644 --- a/src/documentation/programming-guide.md +++ b/src/documentation/programming-guide.md @@ -71,13 +71,13 @@ When you run your Beam driver program, the Pipeline Runner that you designate co ## Creating the pipeline -The `Pipeline` abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing a [Pipeline]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/Pipeline.html)[Pipeline](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/pipeline.py) object, and then using that object as the basis for creating the pipeline's data sets as `PCollection`s and its operations as `Transform`s. +The `Pipeline` abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing a [Pipeline]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/Pipeline.html)[Pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py) object, and then using that object as the basis for creating the pipeline's data sets as `PCollection`s and its operations as `Transform`s. To use Beam, your driver program must first create an instance of the Beam SDK class `Pipeline` (typically in the `main()` function). When you create your `Pipeline`, you'll also need to set some **configuration options**. You can set your pipeline's configuration options programatically, but it's often easier to set the options ahead of time (or read them from the command line) and pass them to the `Pipeline` object when you create the object. The pipeline configuration options determine, among other things, the `PipelineRunner` that determines where the pipeline gets executed: locally, or using a distributed back-end of your choice. Depending on where your pipeline gets executed and what your specifed Runner requires, the options can also help you specify other aspects of execution. -To set your pipeline's configuration options and create the pipeline, create an object of type [PipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/options/PipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/utils/pipeline_options.py) and pass it to `Pipeline.Create()`. The most common way to do this is by parsing arguments from the command-line: +To set your pipeline's configuration options and create the pipeline, create an object of type [PipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/options/PipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) and pass it to `Pipeline.Create()`. The most common way to do this is by parsing arguments from the command-line: ```java public static void main(String[] args) { @@ -333,7 +333,7 @@ class ComputeWordLengthFn(beam.DoFn): # Use return to emit the output element. return [len(word)] -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_apply +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_apply %}``` In the example, our input `PCollection` contains `String` values. We apply a `ParDo` transform that specifies a function (`ComputeWordLengthFn`) to compute the length of each string, and outputs the result to a new `PCollection` of `Integer` values that stores the length of each word. @@ -418,7 +418,7 @@ words = ... # Apply a lambda function to the PCollection words. # Save the result as the PCollection word_lengths. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_using_flatmap +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_using_flatmap %}``` If your `ParDo` performs a one-to-one mapping of input elements to output elements--that is, for each input element, it applies a function that produces *exactly one* output element, you can use the higher-level `MapElements``Map` transform. `MapElements` can accept an anonymous Java 8 lambda function for additional brevity. @@ -442,7 +442,7 @@ words = ... # Apply a Map with a lambda function to the PCollection words. # Save the result as the PCollection word_lengths. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_using_map +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_using_map %}``` {:.language-java} @@ -490,7 +490,7 @@ Thus, `GroupByKey` represents a transform from a multimap (multiple keys to indi #### Using Combine -[`Combine`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Combine.html)[`Combine`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/transforms/core.py) is a Beam transform for combining collections of elements or values in your data. `Combine` has variants that work on entire `PCollection`s, and some that combine the values for each key in `PCollection`s of key/value pairs. +[`Combine`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Combine.html)[`Combine`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) is a Beam transform for combining collections of elements or values in your data. `Combine` has variants that work on entire `PCollection`s, and some that combine the values for each key in `PCollection`s of key/value pairs. When you apply a `Combine` transform, you must provide the function that contains the logic for combining the elements or values. The combining function should be commutative and associative, as the function is not necessarily invoked exactly once on all values with a given key. Because the input data (including the value collection) may be distributed across multiple workers, the combining function might be called multiple times to perform partial combining on subsets of the value collection. The Beam SDK also provides some pre-built combine functions for common numeric combination operations such as sum, min, and max. @@ -515,7 +515,7 @@ public static class SumInts implements SerializableFunction, I ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:combine_bounded_sum +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:combine_bounded_sum %}``` ##### **Advanced combinations using CombineFn** @@ -570,7 +570,7 @@ public class AverageFn extends CombineFn { ```py pc = ... -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:combine_custom_average +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:combine_custom_average %}``` If you are combining a `PCollection` of key-value pairs, [per-key combining](#transforms-combine-per-key) is often enough. If you need the combining strategy to change based on the key (for example, MIN for some users and MAX for other users), you can define a `KeyedCombineFn` to access the key within the combining strategy. @@ -661,7 +661,7 @@ avg_accuracy_per_player = (player_accuracies #### Using Flatten and Partition -[`Flatten`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Flatten.html)[`Flatten`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/transforms/core.py) and [`Partition`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Partition.html)[`Partition`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/transforms/core.py) are Beam transforms for `PCollection` objects that store the same data type. `Flatten` merges multiple `PCollection` objects into a single logical `PCollection`, and `Partition` splits a single `PCollection` into a fixed number of smaller collections. +[`Flatten`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Flatten.html)[`Flatten`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) and [`Partition`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Partition.html)[`Partition`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) are Beam transforms for `PCollection` objects that store the same data type. `Flatten` merges multiple `PCollection` objects into a single logical `PCollection`, and `Partition` splits a single `PCollection` into a fixed number of smaller collections. ##### **Flatten** @@ -811,13 +811,13 @@ Side inputs are useful if your `ParDo` needs to inject additional data when proc # For example, using pvalue.AsIter(pcoll) at pipeline construction time results in an iterable of the actual elements of pcoll being passed into each process invocation. # In this example, side inputs are passed to a FlatMap transform as extra arguments and consumed by filter_using_length. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_side_input +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_side_input %} # We can also pass side inputs to a ParDo transform, which will get passed to its process method. # The only change is that the first arguments are self and a context, rather than the PCollection element itself. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_side_input_dofn +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_side_input_dofn %} ... @@ -893,12 +893,12 @@ While `ParDo` always produces a main output `PCollection` (as the return value f # with_outputs() returns a DoOutputsTuple object. Tags specified in with_outputs are attributes on the returned DoOutputsTuple object. # The tags give access to the corresponding output PCollections. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs %} # The result is also iterable, ordered in the same order that the tags were passed to with_outputs(), the main tag (if specified) first. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs_iter +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs_iter %}``` ##### Emitting to side outputs in your DoFn: @@ -932,13 +932,13 @@ While `ParDo` always produces a main output `PCollection` (as the return value f # using the pvalue.SideOutputValue wrapper class. # Based on the previous example, this shows the DoFn emitting to the main and side outputs. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_emitting_values_on_side_outputs +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_emitting_values_on_side_outputs %} # Side outputs are also available in Map and FlatMap. # Here is an example that uses FlatMap and shows that the tags do not need to be specified ahead of time. -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs_undeclared +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:model_pardo_with_side_outputs_undeclared %}``` ## Pipeline I/O @@ -1047,14 +1047,14 @@ See the language specific source code directories for the Beam supported I/O API Python -

avroio

-

textio

+

avroio

+

textio

-

Google BigQuery

-

Google Cloud Datastore

+

Google BigQuery

+

Google Cloud Datastore

http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/documentation/runners/dataflow.md ---------------------------------------------------------------------- diff --git a/src/documentation/runners/dataflow.md b/src/documentation/runners/dataflow.md index f707d47..f2037a2 100644 --- a/src/documentation/runners/dataflow.md +++ b/src/documentation/runners/dataflow.md @@ -101,7 +101,7 @@ When executing your pipeline with the Cloud Dataflow Runner, set these pipeline -See the reference documentation for the [DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. +See the reference documentation for the [DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. ## Additional information and caveats http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/documentation/runners/direct.md ---------------------------------------------------------------------- diff --git a/src/documentation/runners/direct.md b/src/documentation/runners/direct.md index c96e7b8..babe4cb 100644 --- a/src/documentation/runners/direct.md +++ b/src/documentation/runners/direct.md @@ -37,9 +37,9 @@ You must specify your dependency on the Direct Runner. When executing your pipeline from the command-line, set `runner` to `direct`. The default values for the other pipeline options are generally sufficient. -See the reference documentation for the [`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html)[`PipelineOptions`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for defaults and the complete list of pipeline configuration options. +See the reference documentation for the [`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html)[`PipelineOptions`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for defaults and the complete list of pipeline configuration options. ## Additional information and caveats -Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a [`Create`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Create.html)[`Create`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/transforms/core.py) transform, or you can use a [`Read`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/io/Read.html)[`Read`](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/io/iobase.py) transform to work with small local or remote files. +Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a [`Create`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Create.html)[`Create`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) transform, or you can use a [`Read`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/io/Read.html)[`Read`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py) transform to work with small local or remote files. http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/documentation/runners/flink.md ---------------------------------------------------------------------- diff --git a/src/documentation/runners/flink.md b/src/documentation/runners/flink.md index f2e59b5..ed52689 100644 --- a/src/documentation/runners/flink.md +++ b/src/documentation/runners/flink.md @@ -129,7 +129,7 @@ When executing your pipeline with the Flink Runner, you can set these pipeline o -See the reference documentation for the [FlinkPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/python-sdk/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. +See the reference documentation for the [FlinkPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. ## Additional information and caveats http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/get-started/quickstart-py.md ---------------------------------------------------------------------- diff --git a/src/get-started/quickstart-py.md b/src/get-started/quickstart-py.md index a198eba..57bdefc 100644 --- a/src/get-started/quickstart-py.md +++ b/src/get-started/quickstart-py.md @@ -63,7 +63,7 @@ For instructions using other shells, see the [virtualenv documentation](https:// ### Download and install 1. Clone the Apache Beam repo from GitHub: - `git clone https://github.com/apache/beam.git --branch python-sdk` + `git clone https://github.com/apache/beam.git` 2. Navigate to the `python` directory: `cd beam/sdks/python/` @@ -79,7 +79,7 @@ For instructions using other shells, see the [virtualenv documentation](https:// ## Execute a pipeline locally -The Apache Beam [examples](https://github.com/apache/beam/tree/python-sdk/sdks/python/apache_beam/examples) directory has many examples. All examples can be run locally by passing the required arguments described in the example script. +The Apache Beam [examples](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples) directory has many examples. All examples can be run locally by passing the required arguments described in the example script. For example, to run `wordcount.py`, run: http://git-wip-us.apache.org/repos/asf/beam-site/blob/f9eb9fc3/src/get-started/wordcount-example.md ---------------------------------------------------------------------- diff --git a/src/get-started/wordcount-example.md b/src/get-started/wordcount-example.md index bf484b2..b6e1985 100644 --- a/src/get-started/wordcount-example.md +++ b/src/get-started/wordcount-example.md @@ -69,7 +69,7 @@ You can specify a runner for executing your pipeline, such as the `DataflowRunne ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_options +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_options %}``` The next step is to create a Pipeline object with the options we've just constructed. The Pipeline object builds up the graph of transformations to be executed, associated with that particular pipeline. @@ -79,7 +79,7 @@ Pipeline p = Pipeline.create(options); ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_create +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_create %}``` ### Applying Pipeline Transforms @@ -100,7 +100,7 @@ The Minimal WordCount pipeline contains five transforms: ``` ```py - {% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_read + {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_read %}``` 2. A [ParDo]({{ site.baseurl }}/documentation/programming-guide/#transforms-pardo) transform that invokes a `DoFn` (defined in-line as an anonymous class) on each element that tokenizes the text lines into individual words. The input for this transform is the `PCollection` of text lines generated by the previous `TextIO.Read` transform. The `ParDo` transform outputs a new `PCollection`, where each element represents an individual word in the text. @@ -120,7 +120,7 @@ The Minimal WordCount pipeline contains five transforms: ```py # The Flatmap transform is a simplified version of ParDo. - {% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_pardo + {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_pardo %}``` 3. The SDK-provided `Count` transform is a generic transform that takes a `PCollection` of any type, and returns a `PCollection` of key/value pairs. Each key represents a unique element from the input collection, and each value represents the number of times that key appeared in the input collection. @@ -132,7 +132,7 @@ The Minimal WordCount pipeline contains five transforms: ``` ```py - {% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_count + {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_count %}``` 4. The next transform formats each of the key/value pairs of unique words and occurrence counts into a printable string suitable for writing to an output file. @@ -149,7 +149,7 @@ The Minimal WordCount pipeline contains five transforms: ``` ```py - {% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_map + {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_map %}``` 5. A text file write transform. This transform takes the final `PCollection` of formatted Strings as input and writes each element to an output text file. Each element in the input `PCollection` represents one line of text in the resulting output file. @@ -159,7 +159,7 @@ The Minimal WordCount pipeline contains five transforms: ``` ```py - {% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_write + {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_write %}``` Note that the `Write` transform produces a trivial result value of type `PDone`, which in this case is ignored. @@ -173,7 +173,7 @@ p.run().waitUntilFinish(); ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_run +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_minimal_run %}``` Note that the `run` method is asynchronous. For a blocking execution instead, run your pipeline appending the `waitUntilFinish` method. @@ -214,7 +214,7 @@ static class ExtractWordsFn extends DoFn { ```py # In this example, the DoFns are defined as classes: -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_dofn +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_dofn %}``` ### Creating Composite Transforms @@ -253,7 +253,7 @@ public static void main(String[] args) throws IOException { ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_composite +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_composite %}``` ### Using Parameterizable PipelineOptions @@ -280,7 +280,7 @@ public static void main(String[] args) { ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_options +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:examples_wordcount_wordcount_options %}``` ## Debugging WordCount Example @@ -330,7 +330,7 @@ public class DebuggingWordCount { ``` ```py -{% github_sample /apache/beam/blob/python-sdk/sdks/python/apache_beam/examples/snippets/snippets.py tag:example_wordcount_debugging_logging +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:example_wordcount_debugging_logging %}``` If you execute your pipeline using `DataflowRunner`, you can control the worker log levels. Dataflow workers that execute user code are configured to log to Cloud Logging by default at "INFO" log level and higher. You can override log levels for specific logging namespaces by specifying: `--workerLogLevelOverrides={"Name1":"Level1","Name2":"Level2",...}`. For example, by specifying `--workerLogLevelOverrides={"org.apache.beam.examples":"DEBUG"}` when executing this pipeline using the Dataflow service, Cloud Logging would contain only "DEBUG" or higher level logs for the package in addition to the default "INFO" or higher level logs.