crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-211) Add one-to-many join functionality
Date Sun, 02 Jun 2013 16:32:20 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672587#comment-13672587
] 

Josh Wills commented on CRUNCH-211:
-----------------------------------

Agree that this is a common use case, although it's one that I usually use a Cogroup for to
get a Collection of both items. Is the idea here that one of the collections is large enough
that we might prefer to stream it through directly as an Iterable vs. storing it in a Collection?
                
> Add one-to-many join functionality
> ----------------------------------
>
>                 Key: CRUNCH-211
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-211
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-211.patch
>
>
> A common pattern is a join between two tables where the left-side table contains a single
value per key, and the right-side table contains multiple values per key. An example of such
a join would be a join between users and web click entries:
>     PTable<Long,User> usersById = ...;
>     PTable<Long,WebClick> webClicksByUserId = ...;
> In this case, there can be some situations where it is desirable to bring the User together
with the iterable of all WebClicks. The current join functionality will replicate the User
for each WebClick that it's related to, but each WebClick then needs to be dealt with completely
separately.
> Currently, the only way of getting an iterable of WebClicks together with a single User
in a single method call is by materializing all WebClicks per user in memory using something
like PTable#collectValues, and this approach doesn't work when there are a large number of
WebClicks.
> The intention of this ticket is to add functionality whereby the User and Iterable of
WebClicks are available in a single method call, without the Iterable of WebClicks being materialized
in memory (i.e. a feasible approach for millions or more WebClicks).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message