hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antonio Magnaghi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
Date Mon, 11 Feb 2008 21:50:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567829#action_12567829
] 

Antonio Magnaghi commented on PIG-97:
-------------------------------------

The patch looks good for the POVisitor pattern. I just have comment/question:

it looks like the POPrinter is currently not overriding some of the visitX methods in the
super class (POVisitor), such as visitCogroup, visitSplit, visitUnion. If the local physical
query plan contains such operators no info would be printed out for those operators

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to
use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However,
the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER
flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the
two files and thus there is nothing in one of the bags, so the above lines of code cause the
combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where
more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message