beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (BEAM-1867) Element counts missing on Cloud Dataflow when PCollection has anything other than hardcoded name pattern
Date Thu, 20 Apr 2017 23:04:04 GMT


ASF GitHub Bot commented on BEAM-1867:

GitHub user kennknowles opened a pull request:

    [BEAM-1867] Use step-derived PCollection names in Dataflow

    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
     - [ ] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [ ] If this contribution is large, please file an Apache
           [Individual Contributor License Agreement](
    R: @bjchambers 
    This mitigates an issue in Dataflow. I also removed some checked exceptions that are never
caught and probably never should be.
    I have empirically checked that the element counts and byte sizes are restored by this
change, and added unit tests to the translator. Integration tests TBD.

You can merge this pull request into a Git repository by running:

    $ git pull Dataflow-PCollection-names

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2618
commit 4c0bdd6c002b83c67daedd5e01ee2ad0dd47c233
Author: Kenneth Knowles <>
Date:   2017-04-20T21:32:29Z

    Make crashing errors in Structs unchecked exceptions

commit c9ed8f9a69d2b3f17e782f4bd0da9bd4305f2320
Author: Kenneth Knowles <>
Date:   2017-04-20T22:32:51Z

    Derive Dataflow output names from steps, not PCollection names
    Long ago, PCollection names were assigned after transform replacements took
    place, because this happened interleaved with pipeline construction. Now,
    runner-independent graphs are constructed with named PCollections and when
    replacements occur, the names are preserved. This exposed a bug in Dataflow
    whereby the names of steps and the names of PCollections are tightly coupled.
    This change uses the mandatory derived names during translation, shielding
    users from the bug.


> Element counts missing on Cloud Dataflow when PCollection has anything other than hardcoded
name pattern
> --------------------------------------------------------------------------------------------------------
>                 Key: BEAM-1867
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Blocker
>             Fix For: First stable release
> In 0.6.0 and 0.7.0-SNAPSHOT (and possibly all past versions, these are just those where
it is confirmed) element count and byte metrics are not reported correctly when the output
PCollection for a primitive transform is not {{transformname + ".out" + index}}.
> In 0.7.0-SNAPSHOT, the DataflowRunner uses pipeline surgery to replace the composite
{{ParDoSingle}} (that contains a {{ParDoMulti}}) with a Dataflow-specific non-composite {{ParDoSingle}}.
So metrics are reported for names like {{"ParDoSingle(MyDoFn).out"}} when they should be reported
for {{"ParDoSingle/ParDoMulti(MyDoFn).out"}}. So all single-output ParDo transforms lack these
metrics on their outputs.
> In 0.6.0 the same problem occurs if the user ever uses {{PCollection.setName}} to give
their collection a meaningful name.

This message was sent by Atlassian JIRA

View raw message