crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-301) Cogrouping tables where RHS has a Scala tuple value type causes duplicated and missing RHS values
Date Thu, 21 Nov 2013 15:03:36 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828996#comment-13828996
] 

Micah Whitacre commented on CRUNCH-301:
---------------------------------------

Point of clarification, why did you choose to use the context based configuration over the
value that is explicitly set?  If you aren't really expecting them both to be set then I can
understand that it doesn't matter but if there is a case where they both could be set then
I'd expect the set value to take precedence.

> Cogrouping tables where RHS has a Scala tuple value type causes duplicated and missing
RHS values
> -------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-301
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-301
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.8.0
>         Environment: Hadoop 2
>            Reporter: David Whiting
>         Attachments: CRUNCH-301.patch, IsolatedBug.scala
>
>
> Suppose you have three record types, Rec1, Rec2 and Rec3.
> Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by key2. If you
innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and they cogroup it against Rec1,
then instead of surfacing n different (Rec2,Rec3) tuples applicable to the Rec1, it surfaces
just one of the (Rec2, Rec3) tuples multiple times.
> This only happens when running with MRPipeline, and not with MemPipeline.
> Attached is the simplest complete program I could come up with which will produce this
unexpected result:
> The result that is produced is:
> Rec1(1,tjena)	Rec1(1,hello)	(Rec2(1,a,0.5),Rec3(a,4))	(Rec2(1,a,0.5),Rec3(a,4))	(Rec2(1,a,0.5),Rec3(a,4))
(Rec2(1,a,0.5),Rec3(a,4))
> Rec1(2,goodbye)	(Rec2(2,c,9.9),Rec3(c,6))
> As you can see, there's a single (Rec2, Rec3) tuple repeated many times, instead of showing
all the distinct ones. This does not happen if you join against Rec2 on its own.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message