beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jamesmal...@apache.org
Subject [2/3] incubator-beam-site git commit: Addition of sAF blog; delay of PCollection post.
Date Thu, 19 May 2016 04:08:00 GMT
Addition of sAF blog; delay of PCollection post.

sAF blog fixes

Title fix


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/568f051a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/568f051a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/568f051a

Branch: refs/heads/asf-site
Commit: 568f051a6f59ffc7e4dc5d300c75b26ec03a78aa
Parents: 697f438
Author: James Malone <jamalone@gmail.com>
Authored: Wed May 18 20:36:51 2016 -0700
Committer: James Malone <jamalone@gmail.com>
Committed: Wed May 18 21:01:24 2016 -0700

----------------------------------------------------------------------
 _data/authors.yml                               |   4 +
 ...016-05-13-where-is-my-pcollection-dot-map.md |  89 --------
 _posts/2016-05-18-splitAtFraction-method.md     |  17 ++
 ...016-05-20-where-is-my-pcollection-dot-map.md |  89 ++++++++
 .../05/13/where-is-my-pcollection-dot-map.html  | 212 -------------------
 .../blog/2016/05/18/splitAtFraction-method.html | 139 ++++++++++++
 content/blog/index.html                         |   8 +-
 content/capability-matrix/index.html            |   2 +-
 content/contribution-guide/index.html           |   4 +-
 content/feed.xml                                |  99 +--------
 content/index.html                              |   2 +-
 11 files changed, 268 insertions(+), 397 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/_data/authors.yml
----------------------------------------------------------------------
diff --git a/_data/authors.yml b/_data/authors.yml
index ce732c7..930a514 100644
--- a/_data/authors.yml
+++ b/_data/authors.yml
@@ -14,3 +14,7 @@ robertwb:
     name: Robert Bradshaw
     email: robertwb@apache.org
     twitter:
+dhalperi:
+    name: Dan Halperin
+    email: dhalperi@apache.org
+    twitter:

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/_posts/2016-05-13-where-is-my-pcollection-dot-map.md
----------------------------------------------------------------------
diff --git a/_posts/2016-05-13-where-is-my-pcollection-dot-map.md b/_posts/2016-05-13-where-is-my-pcollection-dot-map.md
deleted file mode 100644
index 3193ea1..0000000
--- a/_posts/2016-05-13-where-is-my-pcollection-dot-map.md
+++ /dev/null
@@ -1,89 +0,0 @@
----
-layout: post
-title:  "Where's my PCollection.map()?"
-date:   2016-05-13 11:00:00 -0700
-excerpt_separator: <!--more-->
-categories: beam model pcollection
-authors:
-  - robertwb
----
-Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.
-
-<!--more-->
-
-Though Beam is relatively new, its design draws heavily on many years of experience with real-world pipelines. One of the primary inspirations is FlumeJava, which is Google's internal successor to MapReduce first introduced in 2009.
-
-The original FlumeJava API has methods like `count` and `parallelDo` on the `PCollection`s. Though slightly more succinct, this approach has many disadvantages to extensibility. Every new user to FlumeJava wanted to add transforms, and adding them as methods to `PCollection` simply doesn't scale well. In contrast, a PCollection in Beam has a single `apply` method which takes any PTransform as an argument.
-
-<table class="table">
-  <tr>
-    <th>FlumeJava</th>
-    <th>Beam</th>
-  </tr>
-  <tr>
-    <td><pre>
-PCollection&lt;T&gt; input = …
-PCollection&lt;O&gt; output = input.count()
-                             .parallelDo(...);
-    </pre></td>
-    <td><pre>
-PCollection&lt;T&gt; input = …
-PCollection&lt;O&gt; output = input.apply(Count.perElement())
-                             .apply(ParDo.of(...));
-    </pre></td>
-  </tr>
-</table>
-
-This is a more scalable approach for several reasons.
-
-## Where to draw the line?
-Adding methods to `PCollection` forces a line to be drawn between operations that are "useful" enough to merit this special treatment and those that are not. It is easy to make the case for flat map, group by key, and combine per key. But what about filter? Count? Approximate count? Approximate quantiles? Most frequent? WriteToMyFavoriteSource? Going too far down this path leads to a single enormous class that contains nearly everything one could want to do. (FlumeJava's PCollection class is over 5000 lines long with around 70 distinct operations, and it could have been *much* larger had we accepted every proposal.) Furthermore, since Java doesn’t allow adding methods to a class, there is a sharp divide syntactically between those operations that are added to `PCollection` and those that aren’t. A traditional way to share code is with a library of functions, but functions (in traditional languages like Java at least) are written prefix-style, which doesn't mix well with the flue
 nt builder style (e.g. `input.operation1().operation2().operation3()` vs. `operation3(operation1(input).operation2())`).
-
-Instead in Beam we've chosen a style that places all transforms--whether they be primitive operations, composite operations bundled in the SDK, or part of an external library--on equal footing. This also facilitates alternative implementations (which may even take different options) that are easily interchangeable.
-
-<table class="table">
-  <tr>
-    <th>FlumeJava</th>
-    <th>Beam</th>
-  </tr>
-  <tr>
-    <td><pre>
-PCollection&lt;O&gt; output =
-    ExternalLibrary.doStuff(
-        MyLibrary.transform(input, myArgs)
-            .parallelDo(...),
-        externalLibArgs);
-    </pre></td>
-    <td><pre>
-PCollection&lt;O&gt; output = input
-    .apply(MyLibrary.transform(myArgs))
-    .apply(ParDo.of(...))
-    .apply(ExternalLibrary.doStuff(externalLibArgs));
-    &nbsp;
-    </pre></td>
-  </tr>
-</table>
-
-## Configurability
-While it makes for a more fluent style to let values (PCollections) be the objects passed around and manipulated (i.e. the handles to the deferred execution graph), it is the operations themselves that need to be significantly more composable, configurable, and extendable. Using methods doesn't scale well here, especially in a language without default or keyword arguments. For example, a ParDo operation can have any number of side inputs and side outputs, or a write operation may have configurations dealing with encoding and compression. One option is to separate these out into separate overloads or even methods, but that exacerbates the problems above. (FlumeJava evolved over a dozen overloads of the `parallelDo` method!) Another option is to pass a configuration object that can be built up using more fluent idioms like the builder pattern, but at that point one might as well make the configuration object the operation itself, which is what Beam does.
-
-## Type Safety
-Many operations can only be applied to collections whose elements are of a specific type. For example, the GroupByKey operation should only apply to PCollection<KV<K, V>>s. In Java at least, it's not possible to restrict methods based on the element type parameter alone. In FlumeJava, this led us to add a PTable<K, V> subclassing PCollection<KV<K, V>> to contain all the operations specific to PCollections of key-value pairs. This leads to the same question of which T's are special enough to merit being captured by PCollection subclasses. It is not very extensible for third parties and often requires manual downcasts/conversions (which can't be safely chained in Java) and special operations that produce these PCollection specializations.
-
-This is particularly inconvenient for transforms that produce outputs whose element types are the same as (or related to) their input's element types, requiring extra support to generate the right subclasses (e.g. a filter on a PTable<K, V> should produce another PTable<K, V> rather than just a raw PCollection of key-value pairs).
-
-Using PTransforms allows us to sidestep this entire issue. We can place arbitrary constraints on the context in which a transform may be used based on the type of its inputs; for instance GroupByKey is statically typed to only apply to a PCollection<KV<K, V>>. The way this happens is generalizable to arbitrary shapes, without needing to introduce specialized types like PTable.
-
-## Reusability and Structure
-Though PTransforms are generally constructed at the site at which they're used, by pulling them out as separate objects one is able to store them and pass them around.
-
-As pipelines grow and evolve, it is useful to structure your pipeline into modular, often reusable components, and PTransforms allow one to do this nicely in a data-processing pipeline. In addition, modular PTransforms also expose to the system what the logical structure of your code (e.g. for monitoring). Of the three different representations of the WordCount pipeline below, only the structured view captures the high-level intent of the pipeline. Letting even the simple operations be PTransforms means there's less of an abrupt edge to packaging things up into composite operations.
-
-<img class="center-block" src="{{ "/images/blog/simple-wordcount-pipeline.png" | prepend: site.baseurl }}" alt="Three different visualizations of a simple WordCount pipeline" width="500">
-
-<div class="text-center">
-<i>Three different visualizations of a simple WordCount pipeline which computes the number of occurrences of every work in a set of text files. The flag view gives the full DAG of all operations performed. The execution view groups operations according to how they're executed, e.g. after performing runner-specific optimizations like function composition. The structured view nests operations according to their grouping in PTransforms.</i>
-</div>
-
-## Summary
-Although it's tempting to add methods to PCollections, such an approach is not scalable, extensible, or sufficiently expressive. Putting a single apply() method on PCollection and all the logic into the operation itself lets us have the best of both worlds, and avoids hard cliffs of complexity by having a single consistent style across simple and complex pipelines, and between predefined and user-defined operations.

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/_posts/2016-05-18-splitAtFraction-method.md
----------------------------------------------------------------------
diff --git a/_posts/2016-05-18-splitAtFraction-method.md b/_posts/2016-05-18-splitAtFraction-method.md
new file mode 100644
index 0000000..649af74
--- /dev/null
+++ b/_posts/2016-05-18-splitAtFraction-method.md
@@ -0,0 +1,17 @@
+---
+layout: post
+title:  "Dynamic work rebalancing for Beam"
+date:   2016-05-18 11:00:00 -0700
+excerpt_separator: <!--more-->
+categories: blog
+authors:
+  - dhalperi
+---
+
+This morning, Eugene and Malo from the Google Cloud Dataflow team posted [*No shard left behind: dynamic work rebalancing in Google Cloud Dataflow*](https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow). This article discusses Cloud Dataflow’s solution to the well-known straggler problem.
+
+<!--more-->
+
+In a large batch processing job with many tasks executing in parallel, some of the tasks -- the stragglers -- can take a much longer time to complete than others, perhaps due to imperfect splitting of the work into parallel chunks when issuing the job. Typically, waiting for stragglers means that the overall job completes later than it should, and may also reserve too many machines that may be underutilized at the end. Cloud Dataflow’s dynamic work rebalancing can mitigate stragglers in most cases.
+
+What I’d like to highlight for the Apache Beam (incubating) community is that Cloud Dataflow’s dynamic work rebalancing is implemented using *runner-specific* control logic on top of Beam’s *runner-independent* [`BoundedSource API`](https://github.com/apache/incubator-beam/blob/9fa97fb2491bc784df53fb0f044409dbbc2af3d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java). Specifically, to steal work from a straggler, a runner need only call the reader’s [`splitAtFraction method`](https://github.com/apache/incubator-beam/blob/3edae9b8b4d7afefb5c803c19bb0a1c21ebba89d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L266). This will generate a new source containing leftover work, and then the runner can pass that source off to another idle worker. As Beam matures, I hope that other runners are interested in figuring out whether these APIs can help them improve performance, implementing dynamic work rebalancing, and collaborating on API ch
 anges that will help solve other pain points.

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/_posts/2016-05-20-where-is-my-pcollection-dot-map.md
----------------------------------------------------------------------
diff --git a/_posts/2016-05-20-where-is-my-pcollection-dot-map.md b/_posts/2016-05-20-where-is-my-pcollection-dot-map.md
new file mode 100644
index 0000000..3114164
--- /dev/null
+++ b/_posts/2016-05-20-where-is-my-pcollection-dot-map.md
@@ -0,0 +1,89 @@
+---
+layout: post
+title:  "Where's my PCollection.map()?"
+date:   2016-05-20 11:00:00 -0700
+excerpt_separator: <!--more-->
+categories: blog
+authors:
+  - robertwb
+---
+Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.
+<!--more-->
+Though Beam is relatively new, its design draws heavily on many years of experience with real-world pipelines. One of the primary inspirations is [FlumeJava](http://research.google.com/pubs/pub35650.html), which is Google's internal successor to MapReduce first introduced in 2009.
+
+The original FlumeJava API has methods like `count` and `parallelDo` on the PCollections. Though slightly more succinct, this approach has many disadvantages to extensibility. Every new user to FlumeJava wanted to add transforms, and adding them as methods to PCollection simply doesn't scale well. In contrast, a PCollection in Beam has a single `apply` method which takes any PTransform as an argument.
+
+Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.
+
+<table class="table">
+  <tr>
+    <th>FlumeJava</th>
+    <th>Beam</th>
+  </tr>
+  <tr>
+    <td><pre>
+PCollection&lt;T&gt; input = …
+PCollection&lt;O&gt; output = input.count()
+                             .parallelDo(...);
+    </pre></td>
+    <td><pre>
+PCollection&lt;T&gt; input = …
+PCollection&lt;O&gt; output = input.apply(Count.perElement())
+                             .apply(ParDo.of(...));
+    </pre></td>
+  </tr>
+</table>
+
+This is a more scalable approach for several reasons.
+
+## Where to draw the line?
+Adding methods to PCollection forces a line to be drawn between operations that are "useful" enough to merit this special treatment and those that are not. It is easy to make the case for flat map, group by key, and combine per key. But what about filter? Count? Approximate count? Approximate quantiles? Most frequent? WriteToMyFavoriteSource? Going too far down this path leads to a single enormous class that contains nearly everything one could want to do. (FlumeJava's PCollection class is over 5000 lines long with around 70 distinct operations, and it could have been *much* larger had we accepted every proposal.) Furthermore, since Java doesn’t allow adding methods to a class, there is a sharp syntactic divide between those operations that are added to PCollection and those that aren’t. A traditional way to share code is with a library of functions, but functions (in traditional languages like Java at least) are written prefix-style, which doesn't mix well with the fluent build
 er style (e.g. `input.operation1().operation2().operation3()` vs. `operation3(operation1(input).operation2())`).
+
+Instead in Beam we've chosen a style that places all transforms--whether they be primitive operations, composite operations bundled in the SDK, or part of an external library--on equal footing. This also facilitates alternative implementations (which may even take different options) that are easily interchangeable.
+
+<table class="table">
+  <tr>
+    <th>FlumeJava</th>
+    <th>Beam</th>
+  </tr>
+  <tr>
+    <td><pre>
+PCollection&lt;O&gt; output =
+    ExternalLibrary.doStuff(
+        MyLibrary.transform(input, myArgs)
+            .parallelDo(...),
+        externalLibArgs);
+    </pre></td>
+    <td><pre>
+PCollection&lt;O&gt; output = input
+    .apply(MyLibrary.transform(myArgs))
+    .apply(ParDo.of(...))
+    .apply(ExternalLibrary.doStuff(externalLibArgs));
+    &nbsp;
+    </pre></td>
+  </tr>
+</table>
+
+## Configurability
+It makes for a fluent style to let values (PCollections) be the objects passed around and manipulated (i.e. the handles to the deferred execution graph), but it is the operations themselves that need to be composable, configurable, and extendable. Using PCollection methods for the operations doesn't scale well here, especially in a language without default or keyword arguments. For example, a ParDo operation can have any number of side inputs and side outputs, or a write operation may have configurations dealing with encoding and compression. One option is to separate these out into multiple overloads or even methods, but that exacerbates the problems above. (FlumeJava evolved over a dozen overloads of the `parallelDo` method!) Another option is to pass each method a configuration object that can be built up using more fluent idioms like the builder pattern, but at that point one might as well make the configuration object the operation itself, which is what Beam does.
+
+## Type Safety
+Many operations can only be applied to collections whose elements are of a specific type. For example, the GroupByKey operation should only be applied to `PCollection<KV<K, V>>`s. In Java at least, it's not possible to restrict methods based on the element type parameter alone. In FlumeJava, this led us to add a `PTable<K, V>` subclassing `PCollection<KV<K, V>>` to contain all the operations specific to PCollections of key-value pairs. This leads to the same question of which element types are special enough to merit being captured by PCollection subclasses. It is not very extensible for third parties and often requires manual downcasts/conversions (which can't be safely chained in Java) and special operations that produce these PCollection specializations.
+
+This is particularly inconvenient for transforms that produce outputs whose element types are the same as (or related to) their input's element types, requiring extra support to generate the right subclasses (e.g. a filter on a PTable should produce another PTable rather than just a raw PCollection of key-value pairs).
+
+Using PTransforms allows us to sidestep this entire issue. We can place arbitrary constraints on the context in which a transform may be used based on the type of its inputs; for instance GroupByKey is statically typed to only apply to a `PCollection<KV<K, V>>`. The way this happens is generalizable to arbitrary shapes, without needing to introduce specialized types like PTable.
+
+## Reusability and Structure
+Though PTransforms are generally constructed at the site at which they're used, by pulling them out as separate objects one is able to store them and pass them around.
+
+As pipelines grow and evolve, it is useful to structure your pipeline into modular, often reusable components, and PTransforms allow one to do this nicely in a data-processing pipeline. In addition, modular PTransforms also expose the logical structure of your code to the system (e.g. for monitoring). Of the three different representations of the WordCount pipeline below, only the structured view captures the high-level intent of the pipeline. Letting even the simple operations be PTransforms means there's less of an abrupt edge to packaging things up into composite operations.
+
+<img class="center-block" src="{{ "/images/blog/simple-wordcount-pipeline.png" | prepend: site.baseurl }}" alt="Three different visualizations of a simple WordCount pipeline" width="500">
+
+<div class="text-center">
+<i>Three different visualizations of a simple WordCount pipeline which computes the number of occurrences of every word in a set of text files. The flag view gives the full DAG of all operations performed. The execution view groups operations according to how they're executed, e.g. after performing runner-specific optimizations like function composition. The structured view nests operations according to their grouping in PTransforms.</i>
+</div>
+
+## Summary
+Although it's tempting to add methods to PCollections, such an approach is not scalable, extensible, or sufficiently expressive. Putting a single apply method on PCollection and all the logic into the operation itself lets us have the best of both worlds, and avoids hard cliffs of complexity by having a single consistent style across simple and complex pipelines, and between predefined and user-defined operations.

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html
----------------------------------------------------------------------
diff --git a/content/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html b/content/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html
deleted file mode 100644
index 0017365..0000000
--- a/content/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html
+++ /dev/null
@@ -1,212 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-
-  <head>
-  <meta charset="utf-8">
-  <meta http-equiv="X-UA-Compatible" content="IE=edge">
-  <meta name="viewport" content="width=device-width, initial-scale=1">
-
-  <title>Where&#39;s my PCollection.map()?</title>
-  <meta name="description" content="Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and oth...">
-
-  <link rel="stylesheet" href="/styles/site.css">
-  <link rel="stylesheet" href="/css/theme.css">
-  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script>
-  <script src="/js/bootstrap.min.js"></script>
-  <link rel="canonical" href="http://beam.incubator.apache.org/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html">
-  <link rel="alternate" type="application/rss+xml" title="Apache Beam (incubating)" href="http://beam.incubator.apache.org/feed.xml">
-  <script>
-    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
-    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
-    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
-    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
-
-    ga('create', 'UA-73650088-1', 'auto');
-    ga('send', 'pageview');
-
-  </script>
-  <link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico">
-</head>
-
-
-  <body role="document">
-
-    <nav class="navbar navbar-default navbar-fixed-top">
-  <div class="container">
-    <div class="navbar-header">
-      <a href="/" class="navbar-brand" >
-        <img alt="Brand" style="height: 25px" src="/images/beam_logo_navbar.png">
-      </a>
-      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
-        <span class="sr-only">Toggle navigation</span>
-        <span class="icon-bar"></span>
-        <span class="icon-bar"></span>
-        <span class="icon-bar"></span>
-      </button>
-    </div>
-    <div id="navbar" class="navbar-collapse collapse">
-      <ul class="nav navbar-nav">
-        <li class="dropdown">
-          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Documentation <span class="caret"></span></a>
-          <ul class="dropdown-menu">
-            <li class="dropdown-header">Guides</li>
-            <li><a href="/getting_started/">Getting Started</a></li>
-            <li role="separator" class="divider"></li>
-            <li class="dropdown-header">Technical Documentation</li>
-            <li><a href="/capability-matrix/">Capability Matrix</a></li>
-            <li><a href="https://goo.gl/ps8twC">Technical Docs</a></li>
-            <li><a href="https://goo.gl/nk5OM0">Technical Vision</a></li>
-          </ul>
-        </li>
-        <li class="dropdown">
-          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
-          <ul class="dropdown-menu">
-            <li class="dropdown-header">Community</li>
-            <li><a href="/mailing_lists/">Mailing Lists</a></li>
-            <li><a href="/team/">Apache Beam Team</a></li>
-            <li><a href="/public-meetings/">Public Meetings</a></li>
-            <li role="separator" class="divider"></li>
-            <li class="dropdown-header">Contribute</li>
-            <li><a href="/contribution-guide/">Contribution Guide</a></li>
-            <li><a href="/source_repository/">Source Repository</a></li>
-            <li><a href="/issue_tracking/">Issue Tracking</a></li>
-          </ul>
-        </li>
-        <li><a href="/blog">Blog</a></li>
-        <li class="dropdown">
-          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Project <span class="caret"></span></a>
-          <ul class="dropdown-menu">
-            <li><a href="/presentation-materials/">Presentation Materials</a></li>
-            <li><a href="/material/">Logos and design</a></li>
-            <li><a href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
-          </ul>
-        </li>
-      </ul>
-    </div><!--/.nav-collapse -->
-  </div>
-</nav>
-
-
-<link rel="stylesheet" href="">
-
-
-    <div class="container" role="main">
-
-      <div class="container">
-        
-
-<article class="post" itemscope itemtype="http://schema.org/BlogPosting">
-
-  <header class="post-header">
-    <h1 class="post-title" itemprop="name headline">Where's my PCollection.map()?</h1>
-    <p class="post-meta"><time datetime="2016-05-13T11:00:00-07:00" itemprop="datePublished">May 13, 2016</time> •  Robert Bradshaw 
-</p>
-  </header>
-
-  <div class="post-content" itemprop="articleBody">
-    <p>Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.</p>
-
-<!--more-->
-
-<p>Though Beam is relatively new, its design draws heavily on many years of experience with real-world pipelines. One of the primary inspirations is FlumeJava, which is Google’s internal successor to MapReduce first introduced in 2009.</p>
-
-<p>The original FlumeJava API has methods like <code class="highlighter-rouge">count</code> and <code class="highlighter-rouge">parallelDo</code> on the <code class="highlighter-rouge">PCollection</code>s. Though slightly more succinct, this approach has many disadvantages to extensibility. Every new user to FlumeJava wanted to add transforms, and adding them as methods to <code class="highlighter-rouge">PCollection</code> simply doesn’t scale well. In contrast, a PCollection in Beam has a single <code class="highlighter-rouge">apply</code> method which takes any PTransform as an argument.</p>
-
-<table class="table">
-  <tr>
-    <th>FlumeJava</th>
-    <th>Beam</th>
-  </tr>
-  <tr>
-    <td><pre>
-PCollection&lt;T&gt; input = …
-PCollection&lt;O&gt; output = input.count()
-                             .parallelDo(...);
-    </pre></td>
-    <td><pre>
-PCollection&lt;T&gt; input = …
-PCollection&lt;O&gt; output = input.apply(Count.perElement())
-                             .apply(ParDo.of(...));
-    </pre></td>
-  </tr>
-</table>
-
-<p>This is a more scalable approach for several reasons.</p>
-
-<h2 id="where-to-draw-the-line">Where to draw the line?</h2>
-<p>Adding methods to <code class="highlighter-rouge">PCollection</code> forces a line to be drawn between operations that are “useful” enough to merit this special treatment and those that are not. It is easy to make the case for flat map, group by key, and combine per key. But what about filter? Count? Approximate count? Approximate quantiles? Most frequent? WriteToMyFavoriteSource? Going too far down this path leads to a single enormous class that contains nearly everything one could want to do. (FlumeJava’s PCollection class is over 5000 lines long with around 70 distinct operations, and it could have been <em>much</em> larger had we accepted every proposal.) Furthermore, since Java doesn’t allow adding methods to a class, there is a sharp divide syntactically between those operations that are added to <code class="highlighter-rouge">PCollection</code> and those that aren’t. A traditional way to share code is with a library of functions, but functions (in traditional la
 nguages like Java at least) are written prefix-style, which doesn’t mix well with the fluent builder style (e.g. <code class="highlighter-rouge">input.operation1().operation2().operation3()</code> vs. <code class="highlighter-rouge">operation3(operation1(input).operation2())</code>).</p>
-
-<p>Instead in Beam we’ve chosen a style that places all transforms–whether they be primitive operations, composite operations bundled in the SDK, or part of an external library–on equal footing. This also facilitates alternative implementations (which may even take different options) that are easily interchangeable.</p>
-
-<table class="table">
-  <tr>
-    <th>FlumeJava</th>
-    <th>Beam</th>
-  </tr>
-  <tr>
-    <td><pre>
-PCollection&lt;O&gt; output =
-    ExternalLibrary.doStuff(
-        MyLibrary.transform(input, myArgs)
-            .parallelDo(...),
-        externalLibArgs);
-    </pre></td>
-    <td><pre>
-PCollection&lt;O&gt; output = input
-    .apply(MyLibrary.transform(myArgs))
-    .apply(ParDo.of(...))
-    .apply(ExternalLibrary.doStuff(externalLibArgs));
-    &nbsp;
-    </pre></td>
-  </tr>
-</table>
-
-<h2 id="configurability">Configurability</h2>
-<p>While it makes for a more fluent style to let values (PCollections) be the objects passed around and manipulated (i.e. the handles to the deferred execution graph), it is the operations themselves that need to be significantly more composable, configurable, and extendable. Using methods doesn’t scale well here, especially in a language without default or keyword arguments. For example, a ParDo operation can have any number of side inputs and side outputs, or a write operation may have configurations dealing with encoding and compression. One option is to separate these out into separate overloads or even methods, but that exacerbates the problems above. (FlumeJava evolved over a dozen overloads of the <code class="highlighter-rouge">parallelDo</code> method!) Another option is to pass a configuration object that can be built up using more fluent idioms like the builder pattern, but at that point one might as well make the configuration object the operation itself, which is what
  Beam does.</p>
-
-<h2 id="type-safety">Type Safety</h2>
-<p>Many operations can only be applied to collections whose elements are of a specific type. For example, the GroupByKey operation should only apply to PCollection&lt;KV&lt;K, V»s. In Java at least, it’s not possible to restrict methods based on the element type parameter alone. In FlumeJava, this led us to add a PTable&lt;K, V&gt; subclassing PCollection&lt;KV&lt;K, V» to contain all the operations specific to PCollections of key-value pairs. This leads to the same question of which T’s are special enough to merit being captured by PCollection subclasses. It is not very extensible for third parties and often requires manual downcasts/conversions (which can’t be safely chained in Java) and special operations that produce these PCollection specializations.</p>
-
-<p>This is particularly inconvenient for transforms that produce outputs whose element types are the same as (or related to) their input’s element types, requiring extra support to generate the right subclasses (e.g. a filter on a PTable&lt;K, V&gt; should produce another PTable&lt;K, V&gt; rather than just a raw PCollection of key-value pairs).</p>
-
-<p>Using PTransforms allows us to sidestep this entire issue. We can place arbitrary constraints on the context in which a transform may be used based on the type of its inputs; for instance GroupByKey is statically typed to only apply to a PCollection&lt;KV&lt;K, V». The way this happens is generalizable to arbitrary shapes, without needing to introduce specialized types like PTable.</p>
-
-<h2 id="reusability-and-structure">Reusability and Structure</h2>
-<p>Though PTransforms are generally constructed at the site at which they’re used, by pulling them out as separate objects one is able to store them and pass them around.</p>
-
-<p>As pipelines grow and evolve, it is useful to structure your pipeline into modular, often reusable components, and PTransforms allow one to do this nicely in a data-processing pipeline. In addition, modular PTransforms also expose to the system what the logical structure of your code (e.g. for monitoring). Of the three different representations of the WordCount pipeline below, only the structured view captures the high-level intent of the pipeline. Letting even the simple operations be PTransforms means there’s less of an abrupt edge to packaging things up into composite operations.</p>
-
-<p><img class="center-block" src="/images/blog/simple-wordcount-pipeline.png" alt="Three different visualizations of a simple WordCount pipeline" width="500" /></p>
-
-<div class="text-center">
-<i>Three different visualizations of a simple WordCount pipeline which computes the number of occurrences of every work in a set of text files. The flag view gives the full DAG of all operations performed. The execution view groups operations according to how they're executed, e.g. after performing runner-specific optimizations like function composition. The structured view nests operations according to their grouping in PTransforms.</i>
-</div>
-
-<h2 id="summary">Summary</h2>
-<p>Although it’s tempting to add methods to PCollections, such an approach is not scalable, extensible, or sufficiently expressive. Putting a single apply() method on PCollection and all the logic into the operation itself lets us have the best of both worlds, and avoids hard cliffs of complexity by having a single consistent style across simple and complex pipelines, and between predefined and user-defined operations.</p>
-
-  </div>
-
-</article>
-
-      </div>
-
-
-    <hr>
-  <div class="row">
-      <div class="col-xs-12">
-          <footer>
-              <p class="text-center">&copy; Copyright 2016
-                <a href="http://www.apache.org">The Apache Software Foundation.</a> All Rights Reserved.</p>
-                <p class="text-center"><a href="/privacy_policy">Privacy Policy</a> |
-                <a href="/feed.xml">RSS Feed</a></p>
-          </footer>
-      </div>
-  </div>
-  <!-- container div end -->
-</div>
-
-
-  </body>
-
-</html>

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/blog/2016/05/18/splitAtFraction-method.html
----------------------------------------------------------------------
diff --git a/content/blog/2016/05/18/splitAtFraction-method.html b/content/blog/2016/05/18/splitAtFraction-method.html
new file mode 100644
index 0000000..d723556
--- /dev/null
+++ b/content/blog/2016/05/18/splitAtFraction-method.html
@@ -0,0 +1,139 @@
+<!DOCTYPE html>
+<html lang="en">
+
+  <head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>Dynamic work rebalancing for Beam</title>
+  <meta name="description" content="This morning, Eugene and Malo from the Google Cloud Dataflow team posted No shard left behind: dynamic work rebalancing in Google Cloud Dataflow. This articl...">
+
+  <link rel="stylesheet" href="/styles/site.css">
+  <link rel="stylesheet" href="/css/theme.css">
+  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <link rel="canonical" href="http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html">
+  <link rel="alternate" type="application/rss+xml" title="Apache Beam (incubating)" href="http://beam.incubator.apache.org/feed.xml">
+  <script>
+    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+    ga('create', 'UA-73650088-1', 'auto');
+    ga('send', 'pageview');
+
+  </script>
+  <link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico">
+</head>
+
+
+  <body role="document">
+
+    <nav class="navbar navbar-default navbar-fixed-top">
+  <div class="container">
+    <div class="navbar-header">
+      <a href="/" class="navbar-brand" >
+        <img alt="Brand" style="height: 25px" src="/images/beam_logo_navbar.png">
+      </a>
+      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+    </div>
+    <div id="navbar" class="navbar-collapse collapse">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Documentation <span class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li class="dropdown-header">Guides</li>
+            <li><a href="/getting_started/">Getting Started</a></li>
+            <li role="separator" class="divider"></li>
+            <li class="dropdown-header">Technical Documentation</li>
+            <li><a href="/capability-matrix/">Capability Matrix</a></li>
+            <li><a href="https://goo.gl/ps8twC">Technical Docs</a></li>
+            <li><a href="https://goo.gl/nk5OM0">Technical Vision</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li class="dropdown-header">Community</li>
+            <li><a href="/mailing_lists/">Mailing Lists</a></li>
+            <li><a href="/team/">Apache Beam Team</a></li>
+            <li><a href="/public-meetings/">Public Meetings</a></li>
+            <li role="separator" class="divider"></li>
+            <li class="dropdown-header">Contribute</li>
+            <li><a href="/contribution-guide/">Contribution Guide</a></li>
+            <li><a href="/source_repository/">Source Repository</a></li>
+            <li><a href="/issue_tracking/">Issue Tracking</a></li>
+          </ul>
+        </li>
+        <li><a href="/blog">Blog</a></li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Project <span class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li><a href="/presentation-materials/">Presentation Materials</a></li>
+            <li><a href="/material/">Logos and design</a></li>
+            <li><a href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+          </ul>
+        </li>
+      </ul>
+    </div><!--/.nav-collapse -->
+  </div>
+</nav>
+
+
+<link rel="stylesheet" href="">
+
+
+    <div class="container" role="main">
+
+      <div class="container">
+        
+
+<article class="post" itemscope itemtype="http://schema.org/BlogPosting">
+
+  <header class="post-header">
+    <h1 class="post-title" itemprop="name headline">Dynamic work rebalancing for Beam</h1>
+    <p class="post-meta"><time datetime="2016-05-18T11:00:00-07:00" itemprop="datePublished">May 18, 2016</time> •  Dan Halperin 
+</p>
+  </header>
+
+  <div class="post-content" itemprop="articleBody">
+    <p>This morning, Eugene and Malo from the Google Cloud Dataflow team posted <a href="https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow"><em>No shard left behind: dynamic work rebalancing in Google Cloud Dataflow</em></a>. This article discusses Cloud Dataflow’s solution to the well-known straggler problem.</p>
+
+<!--more-->
+
+<p>In a large batch processing job with many tasks executing in parallel, some of the tasks – the stragglers – can take a much longer time to complete than others, perhaps due to imperfect splitting of the work into parallel chunks when issuing the job. Typically, waiting for stragglers means that the overall job completes later than it should, and may also reserve too many machines that may be underutilized at the end. Cloud Dataflow’s dynamic work rebalancing can mitigate stragglers in most cases.</p>
+
+<p>What I’d like to highlight for the Apache Beam (incubating) community is that Cloud Dataflow’s dynamic work rebalancing is implemented using <em>runner-specific</em> control logic on top of Beam’s <em>runner-independent</em> <a href="https://github.com/apache/incubator-beam/blob/9fa97fb2491bc784df53fb0f044409dbbc2af3d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java"><code class="highlighter-rouge">BoundedSource API</code></a>. Specifically, to steal work from a straggler, a runner need only call the reader’s <a href="https://github.com/apache/incubator-beam/blob/3edae9b8b4d7afefb5c803c19bb0a1c21ebba89d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L266"><code class="highlighter-rouge">splitAtFraction method</code></a>. This will generate a new source containing leftover work, and then the runner can pass that source off to another idle worker. As Beam matures, I hope that other runners are interested in figuring out whether
  these APIs can help them improve performance, implementing dynamic work rebalancing, and collaborating on API changes that will help solve other pain points.</p>
+
+  </div>
+
+</article>
+
+      </div>
+
+
+    <hr>
+  <div class="row">
+      <div class="col-xs-12">
+          <footer>
+              <p class="text-center">&copy; Copyright 2016
+                <a href="http://www.apache.org">The Apache Software Foundation.</a> All Rights Reserved.</p>
+                <p class="text-center"><a href="/privacy_policy">Privacy Policy</a> |
+                <a href="/feed.xml">RSS Feed</a></p>
+          </footer>
+      </div>
+  </div>
+  <!-- container div end -->
+</div>
+
+
+  </body>
+
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/blog/index.html
----------------------------------------------------------------------
diff --git a/content/blog/index.html b/content/blog/index.html
index 41862bd..d72dfbb 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -104,16 +104,16 @@
     <p>This is the blog for the Apache Beam project. This blog contains news and updates
 for the project.</p>
 
-<h3 id="a-classpost-link-hrefbeammodelpcollection20160513where-is-my-pcollection-dot-maphtmlwheres-my-pcollectionmapa"><a class="post-link" href="/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html">Where’s my PCollection.map()?</a></h3>
-<p><i>May 13, 2016 •  Robert Bradshaw 
+<h3 id="a-classpost-link-hrefblog20160518splitatfraction-methodhtmldynamic-work-rebalancing-for-beama"><a class="post-link" href="/blog/2016/05/18/splitAtFraction-method.html">Dynamic work rebalancing for Beam</a></h3>
+<p><i>May 18, 2016 •  Dan Halperin 
 </i></p>
 
-<p>Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.</p>
+<p>This morning, Eugene and Malo from the Google Cloud Dataflow team posted <a href="https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow"><em>No shard left behind: dynamic work rebalancing in Google Cloud Dataflow</em></a>. This article discusses Cloud Dataflow’s solution to the well-known straggler problem.</p>
 
 <!-- Render a "read more" button if the post is longer than the excerpt -->
 
 <p>
-<a class="btn btn-default btn-sm" href="/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html#read-more" role="button">
+<a class="btn btn-default btn-sm" href="/blog/2016/05/18/splitAtFraction-method.html#read-more" role="button">
 Read more&nbsp;<span class="glyphicon glyphicon-menu-right" aria-hidden="true"></span>
 </a>
 </p>

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/capability-matrix/index.html
----------------------------------------------------------------------
diff --git a/content/capability-matrix/index.html b/content/capability-matrix/index.html
index 0940360..767df88 100644
--- a/content/capability-matrix/index.html
+++ b/content/capability-matrix/index.html
@@ -95,7 +95,7 @@
 
       <div class="container">
         <h1 id="apache-beam-capability-matrix">Apache Beam Capability Matrix</h1>
-<p><span style="font-size:11px;float:none">Last updated: 2016-05-16 09:11 PDT</span></p>
+<p><span style="font-size:11px;float:none">Last updated: 2016-05-18 20:55 PDT</span></p>
 
 <p>Apache Beam (incubating) provides a portable API layer for building sophisticated data-parallel processing engines that may be executed across a diversity of exeuction engines, or <i>runners</i>. The core concepts of this layer are based upon the Beam Model (formerly referred to as the <a href="http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf">Dataflow Model</a>), and implemented to varying degrees in each Beam runner. To help clarify the capabilities of individual runners, we’ve created the capability matrix below.</p>
 

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/contribution-guide/index.html
----------------------------------------------------------------------
diff --git a/content/contribution-guide/index.html b/content/contribution-guide/index.html
index 65ee67d..1d71e8b 100644
--- a/content/contribution-guide/index.html
+++ b/content/contribution-guide/index.html
@@ -356,8 +356,8 @@ github	https://github.com/apache/incubator-beam.git (push)
 <p>Fetch references from all remote repositories, and checkout the specific pull request branch.</p>
 
 <pre>
-$ git fetch --all
-$ git checkout -b finish-pr-<b>&lt;pull-request-#&gt;</b> github/pr/<b>&lt;pull-request-#&gt;</b></pre>
+&lt;/code&gt;$ git fetch --all
+$ git checkout -b finish-pr-<b>&lt;pull-request-#&gt;</b> github/pr/<b>&lt;pull-request-#&gt;</b>&lt;/code&gt;</pre>
 
 <p>At this point, you can commit any final touches to the pull request. For example, you should:</p>
 

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/feed.xml
----------------------------------------------------------------------
diff --git a/content/feed.xml b/content/feed.xml
index 841bac2..9a41d56 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -6,103 +6,26 @@
 </description>
     <link>http://beam.incubator.apache.org/</link>
     <atom:link href="http://beam.incubator.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Mon, 16 May 2016 09:11:00 -0700</pubDate>
-    <lastBuildDate>Mon, 16 May 2016 09:11:00 -0700</lastBuildDate>
-    <generator>Jekyll v3.1.2</generator>
+    <pubDate>Wed, 18 May 2016 20:55:36 -0700</pubDate>
+    <lastBuildDate>Wed, 18 May 2016 20:55:36 -0700</lastBuildDate>
+    <generator>Jekyll v3.1.3</generator>
     
       <item>
-        <title>Where&#39;s my PCollection.map()?</title>
-        <description>&lt;p&gt;Have you ever wondered why Beam has PTransforms for everything instead of having methods on PCollection? Take a look at the history that led to this (and other) design decisions.&lt;/p&gt;
+        <title>Dynamic work rebalancing for Beam</title>
+        <description>&lt;p&gt;This morning, Eugene and Malo from the Google Cloud Dataflow team posted &lt;a href=&quot;https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow&quot;&gt;&lt;em&gt;No shard left behind: dynamic work rebalancing in Google Cloud Dataflow&lt;/em&gt;&lt;/a&gt;. This article discusses Cloud Dataflow’s solution to the well-known straggler problem.&lt;/p&gt;
 
 &lt;!--more--&gt;
 
-&lt;p&gt;Though Beam is relatively new, its design draws heavily on many years of experience with real-world pipelines. One of the primary inspirations is FlumeJava, which is Google’s internal successor to MapReduce first introduced in 2009.&lt;/p&gt;
+&lt;p&gt;In a large batch processing job with many tasks executing in parallel, some of the tasks – the stragglers – can take a much longer time to complete than others, perhaps due to imperfect splitting of the work into parallel chunks when issuing the job. Typically, waiting for stragglers means that the overall job completes later than it should, and may also reserve too many machines that may be underutilized at the end. Cloud Dataflow’s dynamic work rebalancing can mitigate stragglers in most cases.&lt;/p&gt;
 
-&lt;p&gt;The original FlumeJava API has methods like &lt;code class=&quot;highlighter-rouge&quot;&gt;count&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;parallelDo&lt;/code&gt; on the &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt;s. Though slightly more succinct, this approach has many disadvantages to extensibility. Every new user to FlumeJava wanted to add transforms, and adding them as methods to &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; simply doesn’t scale well. In contrast, a PCollection in Beam has a single &lt;code class=&quot;highlighter-rouge&quot;&gt;apply&lt;/code&gt; method which takes any PTransform as an argument.&lt;/p&gt;
-
-&lt;table class=&quot;table&quot;&gt;
-  &lt;tr&gt;
-    &lt;th&gt;FlumeJava&lt;/th&gt;
-    &lt;th&gt;Beam&lt;/th&gt;
-  &lt;/tr&gt;
-  &lt;tr&gt;
-    &lt;td&gt;&lt;pre&gt;
-PCollection&amp;lt;T&amp;gt; input = …
-PCollection&amp;lt;O&amp;gt; output = input.count()
-                             .parallelDo(...);
-    &lt;/pre&gt;&lt;/td&gt;
-    &lt;td&gt;&lt;pre&gt;
-PCollection&amp;lt;T&amp;gt; input = …
-PCollection&amp;lt;O&amp;gt; output = input.apply(Count.perElement())
-                             .apply(ParDo.of(...));
-    &lt;/pre&gt;&lt;/td&gt;
-  &lt;/tr&gt;
-&lt;/table&gt;
-
-&lt;p&gt;This is a more scalable approach for several reasons.&lt;/p&gt;
-
-&lt;h2 id=&quot;where-to-draw-the-line&quot;&gt;Where to draw the line?&lt;/h2&gt;
-&lt;p&gt;Adding methods to &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; forces a line to be drawn between operations that are “useful” enough to merit this special treatment and those that are not. It is easy to make the case for flat map, group by key, and combine per key. But what about filter? Count? Approximate count? Approximate quantiles? Most frequent? WriteToMyFavoriteSource? Going too far down this path leads to a single enormous class that contains nearly everything one could want to do. (FlumeJava’s PCollection class is over 5000 lines long with around 70 distinct operations, and it could have been &lt;em&gt;much&lt;/em&gt; larger had we accepted every proposal.) Furthermore, since Java doesn’t allow adding methods to a class, there is a sharp divide syntactically between those operations that are added to &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; and those that aren’t. A traditional way to share code is
  with a library of functions, but functions (in traditional languages like Java at least) are written prefix-style, which doesn’t mix well with the fluent builder style (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;input.operation1().operation2().operation3()&lt;/code&gt; vs. &lt;code class=&quot;highlighter-rouge&quot;&gt;operation3(operation1(input).operation2())&lt;/code&gt;).&lt;/p&gt;
-
-&lt;p&gt;Instead in Beam we’ve chosen a style that places all transforms–whether they be primitive operations, composite operations bundled in the SDK, or part of an external library–on equal footing. This also facilitates alternative implementations (which may even take different options) that are easily interchangeable.&lt;/p&gt;
-
-&lt;table class=&quot;table&quot;&gt;
-  &lt;tr&gt;
-    &lt;th&gt;FlumeJava&lt;/th&gt;
-    &lt;th&gt;Beam&lt;/th&gt;
-  &lt;/tr&gt;
-  &lt;tr&gt;
-    &lt;td&gt;&lt;pre&gt;
-PCollection&amp;lt;O&amp;gt; output =
-    ExternalLibrary.doStuff(
-        MyLibrary.transform(input, myArgs)
-            .parallelDo(...),
-        externalLibArgs);
-    &lt;/pre&gt;&lt;/td&gt;
-    &lt;td&gt;&lt;pre&gt;
-PCollection&amp;lt;O&amp;gt; output = input
-    .apply(MyLibrary.transform(myArgs))
-    .apply(ParDo.of(...))
-    .apply(ExternalLibrary.doStuff(externalLibArgs));
-    &amp;nbsp;
-    &lt;/pre&gt;&lt;/td&gt;
-  &lt;/tr&gt;
-&lt;/table&gt;
-
-&lt;h2 id=&quot;configurability&quot;&gt;Configurability&lt;/h2&gt;
-&lt;p&gt;While it makes for a more fluent style to let values (PCollections) be the objects passed around and manipulated (i.e. the handles to the deferred execution graph), it is the operations themselves that need to be significantly more composable, configurable, and extendable. Using methods doesn’t scale well here, especially in a language without default or keyword arguments. For example, a ParDo operation can have any number of side inputs and side outputs, or a write operation may have configurations dealing with encoding and compression. One option is to separate these out into separate overloads or even methods, but that exacerbates the problems above. (FlumeJava evolved over a dozen overloads of the &lt;code class=&quot;highlighter-rouge&quot;&gt;parallelDo&lt;/code&gt; method!) Another option is to pass a configuration object that can be built up using more fluent idioms like the builder pattern, but at that point one might as well make the configuration object the ope
 ration itself, which is what Beam does.&lt;/p&gt;
-
-&lt;h2 id=&quot;type-safety&quot;&gt;Type Safety&lt;/h2&gt;
-&lt;p&gt;Many operations can only be applied to collections whose elements are of a specific type. For example, the GroupByKey operation should only apply to PCollection&amp;lt;KV&amp;lt;K, V»s. In Java at least, it’s not possible to restrict methods based on the element type parameter alone. In FlumeJava, this led us to add a PTable&amp;lt;K, V&amp;gt; subclassing PCollection&amp;lt;KV&amp;lt;K, V» to contain all the operations specific to PCollections of key-value pairs. This leads to the same question of which T’s are special enough to merit being captured by PCollection subclasses. It is not very extensible for third parties and often requires manual downcasts/conversions (which can’t be safely chained in Java) and special operations that produce these PCollection specializations.&lt;/p&gt;
-
-&lt;p&gt;This is particularly inconvenient for transforms that produce outputs whose element types are the same as (or related to) their input’s element types, requiring extra support to generate the right subclasses (e.g. a filter on a PTable&amp;lt;K, V&amp;gt; should produce another PTable&amp;lt;K, V&amp;gt; rather than just a raw PCollection of key-value pairs).&lt;/p&gt;
-
-&lt;p&gt;Using PTransforms allows us to sidestep this entire issue. We can place arbitrary constraints on the context in which a transform may be used based on the type of its inputs; for instance GroupByKey is statically typed to only apply to a PCollection&amp;lt;KV&amp;lt;K, V». The way this happens is generalizable to arbitrary shapes, without needing to introduce specialized types like PTable.&lt;/p&gt;
-
-&lt;h2 id=&quot;reusability-and-structure&quot;&gt;Reusability and Structure&lt;/h2&gt;
-&lt;p&gt;Though PTransforms are generally constructed at the site at which they’re used, by pulling them out as separate objects one is able to store them and pass them around.&lt;/p&gt;
-
-&lt;p&gt;As pipelines grow and evolve, it is useful to structure your pipeline into modular, often reusable components, and PTransforms allow one to do this nicely in a data-processing pipeline. In addition, modular PTransforms also expose to the system what the logical structure of your code (e.g. for monitoring). Of the three different representations of the WordCount pipeline below, only the structured view captures the high-level intent of the pipeline. Letting even the simple operations be PTransforms means there’s less of an abrupt edge to packaging things up into composite operations.&lt;/p&gt;
-
-&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/simple-wordcount-pipeline.png&quot; alt=&quot;Three different visualizations of a simple WordCount pipeline&quot; width=&quot;500&quot; /&gt;&lt;/p&gt;
-
-&lt;div class=&quot;text-center&quot;&gt;
-&lt;i&gt;Three different visualizations of a simple WordCount pipeline which computes the number of occurrences of every work in a set of text files. The flag view gives the full DAG of all operations performed. The execution view groups operations according to how they&#39;re executed, e.g. after performing runner-specific optimizations like function composition. The structured view nests operations according to their grouping in PTransforms.&lt;/i&gt;
-&lt;/div&gt;
-
-&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
-&lt;p&gt;Although it’s tempting to add methods to PCollections, such an approach is not scalable, extensible, or sufficiently expressive. Putting a single apply() method on PCollection and all the logic into the operation itself lets us have the best of both worlds, and avoids hard cliffs of complexity by having a single consistent style across simple and complex pipelines, and between predefined and user-defined operations.&lt;/p&gt;
+&lt;p&gt;What I’d like to highlight for the Apache Beam (incubating) community is that Cloud Dataflow’s dynamic work rebalancing is implemented using &lt;em&gt;runner-specific&lt;/em&gt; control logic on top of Beam’s &lt;em&gt;runner-independent&lt;/em&gt; &lt;a href=&quot;https://github.com/apache/incubator-beam/blob/9fa97fb2491bc784df53fb0f044409dbbc2af3d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;BoundedSource API&lt;/code&gt;&lt;/a&gt;. Specifically, to steal work from a straggler, a runner need only call the reader’s &lt;a href=&quot;https://github.com/apache/incubator-beam/blob/3edae9b8b4d7afefb5c803c19bb0a1c21ebba89d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L266&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;splitAtFraction method&lt;/code&gt;&lt;/a&gt;. This will generate a new source containing leftover work, and then the runner can pass tha
 t source off to another idle worker. As Beam matures, I hope that other runners are interested in figuring out whether these APIs can help them improve performance, implementing dynamic work rebalancing, and collaborating on API changes that will help solve other pain points.&lt;/p&gt;
 </description>
-        <pubDate>Fri, 13 May 2016 11:00:00 -0700</pubDate>
-        <link>http://beam.incubator.apache.org/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html</link>
-        <guid isPermaLink="true">http://beam.incubator.apache.org/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html</guid>
-        
-        
-        <category>beam</category>
+        <pubDate>Wed, 18 May 2016 11:00:00 -0700</pubDate>
+        <link>http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html</link>
+        <guid isPermaLink="true">http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html</guid>
         
-        <category>model</category>
         
-        <category>pcollection</category>
+        <category>blog</category>
         
       </item>
     

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/568f051a/content/index.html
----------------------------------------------------------------------
diff --git a/content/index.html b/content/index.html
index 02100b1..f875a64 100644
--- a/content/index.html
+++ b/content/index.html
@@ -177,7 +177,7 @@
     <h3>Blog</h3>
     <div class="list-group">
     
-    <a class="list-group-item" href="/beam/model/pcollection/2016/05/13/where-is-my-pcollection-dot-map.html">May 13, 2016 - Where's my PCollection.map()?</a>
+    <a class="list-group-item" href="/blog/2016/05/18/splitAtFraction-method.html">May 18, 2016 - Dynamic work rebalancing for Beam</a>
     
     <a class="list-group-item" href="/beam/capability/2016/04/03/presentation-materials.html">Apr 3, 2016 - Apache Beam Presentation Materials</a>
     


Mime
View raw message