beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (BEAM-1250) Remove leaf when materializing PCollection to avoid re-evaluation.
Date Fri, 06 Jan 2017 22:04:59 GMT


ASF GitHub Bot commented on BEAM-1250:

GitHub user amitsela opened a pull request:

    [BEAM-1250] Remove leaf when materializing PCollection to avoid re-ev…

    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
     - [ ] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [ ] If this contribution is large, please file an Apache
           [Individual Contributor License Agreement](

You can merge this pull request into a Git repository by running:

    $ git pull remove-leaf-getvalues

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1747
commit 7e7715035c870c28f4294fe52a0cc7c5d838aee2
Author: Sela <>
Date:   2017-01-06T22:03:34Z

    [BEAM-1250] Remove leaf when materializing PCollection to avoid re-evaluation.


> Remove leaf when materializing PCollection to avoid re-evaluation.
> ------------------------------------------------------------------
>                 Key: BEAM-1250
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Amit Sela
> When materializing a {{PCollection}} (implemented as {{RDD}}), to create a {{PCollectionView}}
for example, the runner should remove the materialized {{RDD}} from the "leaves" set.
> The runner keeps track of leaves left un-handled in the DAG to force action on them -
{{Write}} for one is implemented via a sequence of ParDos which are implemented by the runner
via {{mapPartitions}} so we need to force an action.
> Materializing an {{RDD}} is done via the action {{collect()}} so no reason to keep in
"leaves" set.
> Currently, it remains in the "leaves" set and so it is forced and evaluates the lineage
and if not cached it will execute twice the lineage twice (unless caches are applied for some

This message was sent by Atlassian JIRA

View raw message