pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-983) PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup
Date Mon, 05 Oct 2009 22:32:31 GMT

    [ https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762425#action_12762425

Hadoop QA commented on PIG-983:

+1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 821101.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/console

This message is automatically generated.

> PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup
> ---------------------------------------------------------------------------------------
>                 Key: PIG-983
>                 URL: https://issues.apache.org/jira/browse/PIG-983
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>         Attachments: PIG-983.patch
> The current multi-query optimizer works well with pig scripts like this one:
> {code}
> data = LOAD 'input' AS (a:chararray, b:int, c:int);
> A = GROUP data BY b;
> B = GROUP data BY c;
> C = FOREACH A GENERATE group, COUNT(data);
> D = FOREACH B GENERATE group, SUM(data.b);
> STORE C INTO 'output1';
> STORE D INTO 'output2';
> {code}
> In this case the original three Map-Reduce jobs are merged into one MR job by the optimizer.
> The current optimizer, however, won't reduce the number of MR jobs for the scripts in
which multiple group bys follow a join or a cogroup, such as this one:
> {code}
> data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
> data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
> A = JOIN data1 BY a1, data2 BY a2;
> B = GROUP A BY data1::b1;
> C = GROUP B BY data2::c2;
> E = FOREACH C GENERATE group, SUM(A.data2::b2);
> STORE D INTO 'output1';
> STORE E INTO 'output2';                        
> {code}
> Three MR jobs are still needed to run this script.
> Multi-query optimizer should work with this kind of scripts by merging the group bys
and reducing the overall MR jobs. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message