Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 04D08100E9 for ; Sun, 2 Jun 2013 16:51:21 +0000 (UTC) Received: (qmail 42139 invoked by uid 500); 2 Jun 2013 16:51:20 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 42099 invoked by uid 500); 2 Jun 2013 16:51:20 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 42037 invoked by uid 500); 2 Jun 2013 16:51:20 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 41980 invoked by uid 99); 2 Jun 2013 16:51:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jun 2013 16:51:20 +0000 Date: Sun, 2 Jun 2013 16:51:20 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-211) Add one-to-many join functionality MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672594#comment-13672594 ] Josh Wills commented on CRUNCH-211: ----------------------------------- Nah, I'm on board with it-- especially if it provides a way for folks to do the "right thing" (i.e., one MR job) instead of two (one to do the aggregation over the large collection and a second job to do the join). +1, but please remove the extra java.lang.* and org.apache.crunch.* names in the OneToManyDoFn. > Add one-to-many join functionality > ---------------------------------- > > Key: CRUNCH-211 > URL: https://issues.apache.org/jira/browse/CRUNCH-211 > Project: Crunch > Issue Type: Bug > Reporter: Gabriel Reid > Attachments: CRUNCH-211.patch > > > A common pattern is a join between two tables where the left-side table contains a single value per key, and the right-side table contains multiple values per key. An example of such a join would be a join between users and web click entries: > PTable usersById = ...; > PTable webClicksByUserId = ...; > In this case, there can be some situations where it is desirable to bring the User together with the iterable of all WebClicks. The current join functionality will replicate the User for each WebClick that it's related to, but each WebClick then needs to be dealt with completely separately. > Currently, the only way of getting an iterable of WebClicks together with a single User in a single method call is by materializing all WebClicks per user in memory using something like PTable#collectValues, and this approach doesn't work when there are a large number of WebClicks. > The intention of this ticket is to add functionality whereby the User and Iterable of WebClicks are available in a single method call, without the Iterable of WebClicks being materialized in memory (i.e. a feasible approach for millions or more WebClicks). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira