crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-88) Multiple parallelDos on a PGroupedTableImpl does not work
Date Sat, 06 Oct 2012 04:14:03 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470904#comment-13470904
] 

Gabriel Reid commented on CRUNCH-88:
------------------------------------

[~jwills] Yep, I think we came to the same conclusion but different solutions. I think there
are a few issues with the patch you posted though.

The memory issues is definitely a worry, as loading all values under a single key into memory
will be a problem for some pipelines at my work. We're not sending grouped tables to multiple
output anywhere for the moment, but if we were to try to do that (as I was when I ran into
this), the memory overload would be a showstopper.

The even bigger issue is object reuse. In most cases (i.e. any case where the values in the
iterable have the same type as their serialization type, which is pretty much everything apart
from primitive types and strings), the Iterable just returns a copy of the same single object
with updated state on each iteration. The result is that the cached Iterable ends up being
a list of references to the same single object, with its state being the state of the last-read
values in the input Iterable. 

We could get around this object reuse issue by using the PType#getDetachedValue and create
a deep copy of all values of the Iterable before sending it through to child RTNodes, but
that would mean that we'd need to have access to the PType in RTNode. This would also double
the memory usage of caching all values per key.

The patch that I posted results in two parallel jobs being run to get around this issue, which
is obviously less efficient, but doesn't have these issues. I was thinking that this could
be done in a more efficient way in the future by tagging records by which output path they
would need to have before the groupByKey (in line with the whole MCSR fusion approach), but
didn't see that as feasible (at least not for me) to do on the short term.
                
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>
>                 Key: CRUNCH-88
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-88
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-88.patch, CRUNCH-88.patch
>
>
> Creating multiple distinct PCollections based on a single PGroupedTableImpl does not
work correctly - the content of the PGroupedTableImpl will only be sent to a single outgoing
PCollection, and all other PCollections that stem from the grouped table will not receive
any data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message