crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CRUNCH-301) Cogrouping tables where RHS has a Scala tuple value type causes duplicated and missing RHS values
Date Thu, 21 Nov 2013 20:59:36 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Wills resolved CRUNCH-301.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.0

Committed. Thanks [~davw], [~gabriel.reid], and [~mkwhitacre]!

> Cogrouping tables where RHS has a Scala tuple value type causes duplicated and missing
RHS values
> -------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-301
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-301
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.8.0
>         Environment: Hadoop 2
>            Reporter: David Whiting
>             Fix For: 0.9.0
>
>         Attachments: CRUNCH-301.patch, CRUNCH-301b.patch, IsolatedBug.scala
>
>
> Suppose you have three record types, Rec1, Rec2 and Rec3.
> Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by key2. If you
innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and they cogroup it against Rec1,
then instead of surfacing n different (Rec2,Rec3) tuples applicable to the Rec1, it surfaces
just one of the (Rec2, Rec3) tuples multiple times.
> This only happens when running with MRPipeline, and not with MemPipeline.
> Attached is the simplest complete program I could come up with which will produce this
unexpected result:
> The result that is produced is:
> Rec1(1,tjena)	Rec1(1,hello)	(Rec2(1,a,0.5),Rec3(a,4))	(Rec2(1,a,0.5),Rec3(a,4))	(Rec2(1,a,0.5),Rec3(a,4))
(Rec2(1,a,0.5),Rec3(a,4))
> Rec1(2,goodbye)	(Rec2(2,c,9.9),Rec3(c,6))
> As you can see, there's a single (Rec2, Rec3) tuple repeated many times, instead of showing
all the distinct ones. This does not happen if you join against Rec2 on its own.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message