crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-174) Add support for join3 and cogroup3
Date Thu, 18 Jul 2013 06:48:49 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712074#comment-13712074
] 

Gabriel Reid commented on CRUNCH-174:
-------------------------------------

Nice. I have been working with Pig quite a bit lately, and the ability to do stuff like this
(at least in terms of joins) was making me wonder why we didn't have it in Crunch yet :-)

One idea around the implementation: as I understand it right now based on an initial readthrough,
the initial mapper maps values into sparse arrays, and then the second phase combines those
sparse arrays, so for triples it's like this:

   A = { 1: '1A' }
   B = { 1: '1B' }
   C = { 1: '1C' }
   D = { 2, '2D' }
   
   // pcollection after first phase
   UNION = [
                   1: ('1A', null, null), 
                   1: (null, '1B', null),
                   1: (null, null, '1C)
                   2: ('2D', null, null)]

And then the final value is made by looping through all tupleNs under the same key and combining
combining their non-null values into a collection that's at the same tuple index as the non-null
value.

Seeing as there's only ever one value per sparse array after the first phase, I was thinking
it could probably be more efficient for larger tuples (particularly tupleNs) to just work
with a pair of (index, value) instead of using sparse tuples. Using this method, the union
after the first phase would look like this:

    UNION = [
                    1: (0, '1A'),
                    1: (1, '1B'),
                    1: (2, '1C') 
                    2: (0, '2D')]

I think this'll make it a bit more efficient in terms of not needing to allocate arrays that
are mostly not going to be used, as well as removing the need for the loop over the sparse
tuples in PostGroupFn.

What do you think?
                
> Add support for join3 and cogroup3
> ----------------------------------
>
>                 Key: CRUNCH-174
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-174
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-174.patch
>
>
> This seemed like a nice starter JIRA: it would be great to have the three (and even four!)
argument analogues of Join.join() and Cogroup.cogroup(), something like:
> PTable<K, Tuple3<V1, V2, V3>> j = Join.join(PTable<K, V1> a, PTable<K,
V2> b, PTable<K, V3> c);
> ... and similar for co-groups.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message