hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
Date Tue, 28 Sep 2010 19:00:41 GMT

    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915880#action_12915880
] 

Daniel Dai commented on PIG-1637:
---------------------------------

test-patch result for PIG-1637-2.patch:

     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler
warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of
release audit warnings.


> Combiner not use because optimizor inserts a foreach between group and algebric function
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-1637
>                 URL: https://issues.apache.org/jira/browse/PIG-1637
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>         Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info,
page_links);
> B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as
estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used afterward, it add
a foreach statement after C. This is how it looks like after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info,
page_links);
> B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as
estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not
merge these two foreach. The reason is that one output of the first foreach (B) is referred
twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually,
C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating
B twice. So C1 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message