hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach
Date Mon, 05 Jan 2009 21:51:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660944#action_12660944
] 

alangates edited comment on PIG-580 at 1/5/09 1:51 PM:
--------------------------------------------------------

In CombinerOptimizer.visitDistinct you have:

{code}
+            if(sawDistinctAgg) {
+                // we want to combine only in the case where there is only
+                // one PODistinct which is the only input to an agg
+                // we apparently have seen a PODistinct before, so lets not
+                // combine.
+                sawNonAlgebraic = true;
+            }
{code}

but I can envision a case where you want to count multiple distinct things:

{code}
A = load ...
B = group A by $0;
C = foreach B {
       Aa = B.$1;
       Ab = distinct Aa;
       Ba = B.$2;
       Bb = distinct Ba;
       generate group, COUNT(Ab), COUNT(Bb);
}
{code}

Is there a reason we need to not use the combiner with multiple distincts?

      was (Author: alangates):
    In CombinerOptimizer.visitDistinct you have:

{code}
+            if(sawDistinctAgg) {
+                // we want to combine only in the case where there is only
+                // one PODistinct which is the only input to an agg
+                // we apparently have seen a PODistinct before, so lets not
+                // combine.
+                sawNonAlgebraic = true;
+            }
{code}

but I can envision a case where you want to count multiple distinct things:

{code}
A = load ...
B = group A by $0;
C = foreach B {
       Aa = B.$1;
       Ab = distinct Aa;
       Ba = B.$2;
       Bb = distinct Ba;
       generate group, COUNT(Ab), COUNT(Bb);
}

Is there a reason we need to not use the combiner with multiple distincts?
  
> PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach
following a group provided there are no non-algebraics in the foreach 
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-580
>                 URL: https://issues.apache.org/jira/browse/PIG-580
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-580-v2.patch, PIG-580.patch
>
>
> Currently Pig uses the combiner only when there is foreach following a group when the
elements in the foreach generate have the following characteristics:
> 1) simple project of the "group" column
> 2) Algebraic UDF
> The above conditions exclude use of the combiner for distinct aggregates - the distinct
operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic
udf). So if the following foreach should also be combinable:
> {code}
> ..
> b = group a by $0;
> c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
> {code}
> The combiner optimizer should cause the distinct to be combined and the final combine
output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message