crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-88) Multiple parallelDos on a PGroupedTableImpl does not work
Date Fri, 05 Oct 2012 18:58:02 GMT


Josh Wills updated CRUNCH-88:

    Attachment: CRUNCH-88.patch

So it turned out to be an execution problem, not a planning problem. If a groupByKey has multiple
children, the first child can consume the output of all of the Iterable<V> values before
the other children get a chance to process them. The solution I implemented detects when we're
in this situation and caches the Iterable<V> in memory so it can be processed by each
child in turn. I imagine we'll need to make it more clever over time (to support, e.g., spilling
to disk), but this fixes the immediate problem.
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>                 Key: CRUNCH-88
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-88.patch, CRUNCH-88.patch
> Creating multiple distinct PCollections based on a single PGroupedTableImpl does not
work correctly - the content of the PGroupedTableImpl will only be sent to a single outgoing
PCollection, and all other PCollections that stem from the grouped table will not receive
any data.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message