Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0BBBB200C4B for ; Mon, 20 Mar 2017 22:57:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0A49B160B76; Mon, 20 Mar 2017 21:57:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CFC7C160B71 for ; Mon, 20 Mar 2017 22:56:59 +0100 (CET) Received: (qmail 10196 invoked by uid 500); 20 Mar 2017 21:56:59 -0000 Mailing-List: contact commits-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list commits@beam.apache.org Received: (qmail 10187 invoked by uid 99); 20 Mar 2017 21:56:59 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Mar 2017 21:56:59 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id D5447DFFCF; Mon, 20 Mar 2017 21:56:58 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: davor@apache.org To: commits@beam.apache.org Date: Mon, 20 Mar 2017 21:56:58 -0000 Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: [1/3] beam-site git commit: Add windowing section to programming guide archived-at: Mon, 20 Mar 2017 21:57:01 -0000 Repository: beam-site Updated Branches: refs/heads/asf-site 3b3bc65c2 -> 61de614fb Add windowing section to programming guide Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/ddb60795 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/ddb60795 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/ddb60795 Branch: refs/heads/asf-site Commit: ddb6079538a1683b798c72644eaa32f5aa7211bb Parents: 3b3bc65 Author: melissa Authored: Wed Mar 1 19:54:17 2017 -0800 Committer: Davor Bonaci Committed: Mon Mar 20 14:56:17 2017 -0700 ---------------------------------------------------------------------- src/documentation/programming-guide.md | 226 ++++++++++++++++++++++- src/images/fixed-time-windows.png | Bin 0 -> 11717 bytes src/images/session-windows.png | Bin 0 -> 16697 bytes src/images/sliding-time-windows.png | Bin 0 -> 16537 bytes src/images/unwindowed-pipeline-bounded.png | Bin 0 -> 9589 bytes src/images/windowing-pipeline-bounded.png | Bin 0 -> 13325 bytes src/images/windowing-pipeline-unbounded.png | Bin 0 -> 21890 bytes 7 files changed, 222 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/documentation/programming-guide.md ---------------------------------------------------------------------- diff --git a/src/documentation/programming-guide.md b/src/documentation/programming-guide.md index 57b49e8..e91a856 100644 --- a/src/documentation/programming-guide.md +++ b/src/documentation/programming-guide.md @@ -197,7 +197,7 @@ A `PCollection` can be either **bounded** or **unbounded** in size. A **bounded* The bounded (or unbounded) nature of your `PCollection` affects how Beam processes your data. A bounded `PCollection` can be processed using a batch job, which might read the entire data set once, and perform processing in a job of finite length. An unbounded `PCollection` must be processed using a streaming job that runs continuously, as the entire collection can never be available for processing at any one time. -When performing an operation that groups elements in an unbounded `PCollection`, Beam requires a concept called **Windowing** to divide a continuously updating data set into logical windows of finite size. Beam processes each window as a bundle, and processing continues as the data set is generated. These logical windows are determined by some characteristic associated with a data element, such as a **timestamp**. +When performing an operation that groups elements in an unbounded `PCollection`, Beam requires a concept called **windowing** to divide a continuously updating data set into logical windows of finite size. Beam processes each window as a bundle, and processing continues as the data set is generated. These logical windows are determined by some characteristic associated with a data element, such as a **timestamp**. #### Element timestamps @@ -1193,7 +1193,7 @@ To set the default Coder for a Java Integer int values for a pipeline. -```java +```java PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); @@ -1235,7 +1235,225 @@ public class MyCustomDataType { {:.language-py} The Beam SDK for Python does not support annotating data types with a default coder. If you would like to set a default coder, use the method described in the previous section, *Setting the default coder for a type*. - - +## Working with windowing + +Windowing subdivides a `PCollection` according to the timestamps of its individual elements. Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`, work implicitly on a per-window basis—that is, they process each `PCollection` as a succession of multiple, finite windows, though the entire collection itself may be of unbounded size. + +A related concept, called **triggers**, determines when to emit the results of aggregation as unbounded data arrives. Using a trigger can help to refine the windowing strategy for your `PCollection` to deal with late-arriving data or to provide early results. See the [triggers](#triggers) section for more information. + +### Windowing basics + +Some Beam transforms, such as `GroupByKey` and `Combine`, group multiple elements by a common key. Ordinarily, that grouping operation groups all of the elements that have the same key within the entire data set. With an unbounded data set, it is impossible to collect all of the elements, since new elements are constantly being added and may be infinitely many (e.g. streaming data). If you are working with unbounded `PCollection`s, windowing is especially useful. + +In the Beam model, any `PCollection` (including unbounded `PCollection`s) can be subdivided into logical windows. Each element in a `PCollection` is assigned to one or more windows according to the `PCollection`'s windowing function, and each individual window contains a finite number of elements. Grouping transforms then consider each `PCollection`'s elements on a per-window basis. `GroupByKey`, for example, implicitly groups the elements of a `PCollection` by _key and window_. + +**Caution:** The default windowing behavior is to assign all elements of a `PCollection` to a single, global window, _even for unbounded `PCollection`s_. Before you use a grouping transform such as `GroupByKey` on an unbounded `PCollection`, you must do at least one of the following: + * Set a non-global windowing function. See [Setting your PCollection's windowing function](#setwindowingfunction). + * Set a non-default [trigger](#triggers). This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur. + +If you don't set a non-global windowing function or a non-default trigger for your unbounded `PCollection` and subsequently use a grouping transform such as `GroupByKey` or `Combine`, your pipeline will generate an error upon construction and your job will fail. + +#### Windowing constraints + +After you set the windowing function for a `PCollection`, the elements' windows are used the next time you apply a grouping transform to that `PCollection`. Window grouping occurs on an as-needed basis. If you set a windowing function using the `Window` transform, each element is assigned to a window, but the windows are not considered until `GroupByKey` or `Combine` aggregates across a window and key. This can have different effects on your pipeline. +Consider the example pipeline in the figure below: + +![Diagram of pipeline applying windowing]({{ "/images/windowing-pipeline-unbounded.png" | prepend: site.baseurl }} "Pipeline applying windowing") + +**Figure:** Pipeline applying windowing + +In the above pipeline, we create an unbounded `PCollection` by reading a set of key/value pairs using `KafkaIO`, and then apply a windowing function to that collection using the `Window` transform. We then apply a `ParDo` to the the collection, and then later group the result of that `ParDo` using `GroupByKey`. The windowing function has no effect on the `ParDo` transform, because the windows are not actually used until they're needed for the `GroupByKey`. +Subsequent transforms, however, are applied to the result of the `GroupByKey` -- data is grouped by both key and window. + +#### Using windowing with bounded PCollections + +You can use windowing with fixed-size data sets in **bounded** `PCollection`s. However, note that windowing considers only the implicit timestamps attached to each element of a `PCollection`, and data sources that create fixed data sets (such as `TextIO`) assign the same timestamp to every element. This means that all the elements are by default part of a single, global window. + +To use windowing with fixed data sets, you can assign your own timestamps to each element. To assign timestamps to elements, use a `ParDo` transform with a `DoFn` that outputs each element with a new timestamp (for example, the [WithTimestamps]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/WithTimestamps.html) transform in the Beam SDK for Java). + +To illustrate how windowing with a bounded `PCollection` can affect how your pipeline processes data, consider the following pipeline: + +![Diagram of GroupByKey and ParDo without windowing, on a bounded collection]({{ "/images/unwindowed-pipeline-bounded.png" | prepend: site.baseurl }} "GroupByKey and ParDo without windowing, on a bounded collection") + +**Figure:** `GroupByKey` and `ParDo` without windowing, on a bounded collection. + +In the above pipeline, we create a bounded `PCollection` by reading a set of key/value pairs using `TextIO`. We then group the collection using `GroupByKey`, and apply a `ParDo` transform to the grouped `PCollection`. In this example, the `GroupByKey` creates a collection of unique keys, and then `ParDo` gets applied exactly once per key. + +Note that even if you don’t set a windowing function, there is still a window -- all elements in your `PCollection` are assigned to a single global window. + +Now, consider the same pipeline, but using a windowing function: + +![Diagram of GroupByKey and ParDo with windowing, on a bounded collection]({{ "/images/windowing-pipeline-bounded.png" | prepend: site.baseurl }} "GroupByKey and ParDo with windowing, on a bounded collection") + +**Figure:** `GroupByKey` and `ParDo` with windowing, on a bounded collection. + +As before, the pipeline creates a bounded `PCollection` of key/value pairs. We then set a [windowing function](#setwindowingfunction) for that `PCollection`. The `GroupByKey` transform groups the elements of the `PCollection` by both key and window, based on the windowing function. The subsequent `ParDo` transform gets applied multiple times per key, once for each window. + +### Windowing functions + +You can define different kinds of windows to divide the elements of your `PCollection`. Beam provides several windowing functions, including: + +* Fixed Time Windows +* Sliding Time Windows +* Per-Session Windows +* Single Global Window +* Calendar-based Windows (not supported by the Beam SDK for Python) + +Note that each element can logically belong to more than one window, depending on the windowing function you use. Sliding time windowing, for example, creates overlapping windows wherein a single element can be assigned to multiple windows. + +#### Fixed time windows + +The simplest form of windowing is using **fixed time windows**: given a timestamped `PCollection` which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a five minute interval. + +A fixed time window represents a consistent duration, non overlapping time interval in the data stream. Consider windows with a five-minute duration: all of the elements in your unbounded `PCollection` with timestamp values from 0:00:00 up to (but not including) 0:05:00 belong to the first window, elements with timestamp values from 0:05:00 up to (but not including) 0:10:00 belong to the second window, and so on. + +![Diagram of fixed time windows, 30s in duration]({{ "/images/fixed-time-windows.png" | prepend: site.baseurl }} "Fixed time windows, 30s in duration") + +**Figure:** Fixed time windows, 30s in duration. + +#### Sliding time windows + +A **sliding time window** also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture five minutes worth of data, but a new window starts every ten seconds. The frequency with which sliding windows begin is called the _period_. Therefore, our example would have a window _duration_ of five minutes and a _period_ of ten seconds. + +Because multiple windows overlap, most elements in a data set will belong to more than one window. This kind of windowing is useful for taking running averages of data; using sliding time windows, you can compute a running average of the past five minutes' worth of data, updated every ten seconds, in our example. + +![Diagram of sliding time windows, with 1 minute window duration and 30s window period]({{ "/images/sliding-time-windows.png" | prepend: site.baseurl }} "Sliding time windows, with 1 minute window duration and 30s window period") + +**Figure:** Sliding time windows, with 1 minute window duration and 30s window period. + +#### Session windows + +A **session window** function defines windows that contain elements that are within a certain gap duration of another element. Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. For example, a data stream representing user mouse activity may have long periods of idle time interspersed with high concentrations of clicks. If data arrives after the minimum specified gap duration time, this initiates the start of a new window. + +![Diagram of session windows with a minimum gap duration]({{ "/images/session-windows.png" | prepend: site.baseurl }} "Session windows, with a minimum gap duration") + +**Figure:** Session windows, with a minimum gap duration. Note how each data key has different windows, according to its data distribution. + +#### Single global window + +By default, all data in a `PCollection` is assigned to a single global window. If your data set is of a fixed size, you can leave the global window default for your `PCollection`. + +You can use a single global window if you are working with an unbounded data set, e.g. from a streaming data source; however, use caution when applying aggregating transforms such as `GroupByKey` and `Combine`. A single global window with a default trigger generally requires the entire data set to be available before processing, which is not possible with continuously updating data. To perform aggregations on an unbounded `PCollection` that uses global windowing, you should specify a non-default trigger for that `PCollection`. + +### Setting your PCollection's windowing function + +You can set the windowing function for a `PCollection` by applying the `Window` transform. When you apply the `Window` transform, you must provide a `WindowFn`. The `WindowFn` determines the windowing function your `PCollection` will use for subsequent grouping transforms, such as a fixed or sliding time window. + +Beam provides pre-defined `WindownFn`s for the basic windowing functions described here. You can also define your own `WindowFn` if you have a more complex need. + +When setting a windowing function, you may also want to set a trigger for your `PCollection`. The trigger determines when each individual window is aggregated and emitted, and helps refine how the windowing function performs with respect to late data and computing early results. See the [triggers](#triggers) section for more information. + +#### Setting fixed-time windows + +The following example code shows how to apply `Window` to divide a `PCollection` into fixed windows, each one minute in length: +```java + PCollection items = ...; + PCollection fixed_windowed_items = items.apply( + Window.into(FixedWindows.of(Duration.standardMinutes(1)))); +``` +```py +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:setting_fixed_windows +%} +``` + +#### Setting sliding time windows + +The following example code shows how to apply `Window` to divide a `PCollection` into sliding time windows. Each window is 30 minutes in length, and a new window begins every five seconds: +```java + PCollection items = ...; + PCollection sliding_windowed_items = items.apply( + Window.into(SlidingWindows.of(Duration.standardMinutes(30)).every(Duration.standardSeconds(5)))); +``` +```py +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:setting_sliding_windows +%} +``` + +#### Setting session windows + +The following example code shows how to apply `Window` to divide a `PCollection` into session windows, where each session must be separated by a time gap of at least 10 minutes: +```java + PCollection items = ...; + PCollection session_windowed_items = items.apply( + Window.into(Sessions.withGapDuration(Duration.standardMinutes(10)))); +``` +```py +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:setting_session_windows +%} +``` + +Note that the sessions are per-key — each key in the collection will have its own session groupings depending on the data distribution. + +#### Setting a single global window + +If your `PCollection` is bounded (the size is fixed), you can assign all the elements to a single global window. The following example code shows how to set a single global window for a `PCollection`: + +```java + PCollection items = ...; + PCollection batch_items = items.apply( + Window.into(new GlobalWindows())); +``` +```py +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:setting_global_window +%} +``` + +### Time skew, data lag, and late data + +In any data processing system, there is a certain amount of lag between the time a data event occurs (the "event time", determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the "processing time", determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated. + +For example, let's say we have a `PCollection` that's using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an _event time_ timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window. + +However, data isn't always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a _watermark_, which is the system's notion of when all data in a certain window can be expected to have arrived in the pipeline. Data that arrives with a timestamp after the watermark is considered **late data**. + +From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data. + +Note: For simplicity, we've assumed that we're using a very straightforward watermark that estimates the lag time/time skew. In practice, your `PCollection`'s data source determines the watermark, and watermarks can be more precise or complex. + +#### Managing time skew and late data + +> **Note:** Managing time skew and late data is not supported in the Beam SDK for Python. + +You can allow late data by invoking the `.withAllowedLateness` operation when you set your `PCollection`'s windowing strategy. The following code example demonstrates a windowing strategy that will allow late data up to two days after the end of a window. +```java + PCollection items = ...; + PCollection fixed_windowed_items = items.apply( + Window.into(FixedWindows.of(Duration.standardMinutes(1))) + .withAllowedLateness(Duration.standardDays(2))); +``` + +When you set `.withAllowedLateness` on a `PCollection`, that allowed lateness propagates forward to any subsequent `PCollection` derived from the first `PCollection` you applied allowed lateness to. If you want to change the allowed lateness later in your pipeline, you must do so explictly by applying `Window.withAllowedLateness()` again. + +You can also use triggers to help you refine the windowing strategy for a `PCollection`. You can use triggers to determine exactly when each individual window aggregates and reports its results, including how the window emits late elements. + +### Adding timestamps to a PCollection's elements + +An unbounded source provides a timestamp for each element. Depending on your unbounded source, you may need to configure how the timestamp is extracted from the raw data stream. + +However, bounded sources (such as a file from `TextIO`) do not provide timestamps. If you need timestamps, you must add them to your `PCollection`’s elements. + +You can assign new timestamps to the elements of a `PCollection` by applying a [ParDo](#transforms-pardo) transform that outputs new elements with timestamps that you set. + +An example might be if your pipeline reads log records from an input file, and each log record includes a timestamp field; since your pipeline reads the records in from a file, the file source doesn't assign timestamps automatically. You can parse the timestamp field from each record and use a `ParDo` transform with a `DoFn` to attach the timestamps to each element in your `PCollection`. + +```java + PCollection unstampedLogs = ...; + PCollection stampedLogs = + unstampedLogs.apply(ParDo.of(new DoFn() { + public void processElement(ProcessContext c) { + // Extract the timestamp from log entry we're currently processing. + Instant logTimeStamp = extractTimeStampFromLogEntry(c.element()); + // Use ProcessContext.outputWithTimestamp (rather than + // ProcessContext.output) to emit the entry with timestamp attached. + c.outputWithTimestamp(c.element(), logTimeStamp); + } + })); +``` +```py +{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:setting_timestamp +%} +``` + +## Working with triggers > **Note:** This guide is still in progress. There is an open issue to finish the guide ([BEAM-193](https://issues.apache.org/jira/browse/BEAM-193)) http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/fixed-time-windows.png ---------------------------------------------------------------------- diff --git a/src/images/fixed-time-windows.png b/src/images/fixed-time-windows.png new file mode 100644 index 0000000..832dc64 Binary files /dev/null and b/src/images/fixed-time-windows.png differ http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/session-windows.png ---------------------------------------------------------------------- diff --git a/src/images/session-windows.png b/src/images/session-windows.png new file mode 100644 index 0000000..3ce844c Binary files /dev/null and b/src/images/session-windows.png differ http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/sliding-time-windows.png ---------------------------------------------------------------------- diff --git a/src/images/sliding-time-windows.png b/src/images/sliding-time-windows.png new file mode 100644 index 0000000..056732b Binary files /dev/null and b/src/images/sliding-time-windows.png differ http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/unwindowed-pipeline-bounded.png ---------------------------------------------------------------------- diff --git a/src/images/unwindowed-pipeline-bounded.png b/src/images/unwindowed-pipeline-bounded.png new file mode 100644 index 0000000..7725f34 Binary files /dev/null and b/src/images/unwindowed-pipeline-bounded.png differ http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/windowing-pipeline-bounded.png ---------------------------------------------------------------------- diff --git a/src/images/windowing-pipeline-bounded.png b/src/images/windowing-pipeline-bounded.png new file mode 100644 index 0000000..198ed11 Binary files /dev/null and b/src/images/windowing-pipeline-bounded.png differ http://git-wip-us.apache.org/repos/asf/beam-site/blob/ddb60795/src/images/windowing-pipeline-unbounded.png ---------------------------------------------------------------------- diff --git a/src/images/windowing-pipeline-unbounded.png b/src/images/windowing-pipeline-unbounded.png new file mode 100644 index 0000000..b5c5ee0 Binary files /dev/null and b/src/images/windowing-pipeline-unbounded.png differ