hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach
Date Tue, 30 Dec 2008 19:33:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-580:
-------------------------------

    Patch Info: [Patch Available]
       Summary: PERFORMANCE: Combiner should also be used when there are distinct aggregates
in a foreach following a group provided there are no non-algebraics in the foreach   (was:
Combiner should also be used when there are distinct aggregates in a foreach following a group
provided there are no non-algebraics in the foreach )

The changes are mostly in CombinerOptimizer and an additional UDF - Distinct.java. In the
combiner optimizer if the inner plan of the foreach following a group has a distinct feeding
into a UDF (algebraic or not), that plan is considered to be "combineable" - it is marked
as being of type ExprType.DISTINCT. If there is atleast one inner plan in the foreach of type
-  ExprType.DISTINCT or ExprType.ALGEBRAIC and no inner plan of type ExprType.NOT_ALGEBRAIC
then the optimizer considers the foreach to be combineable. If an inner plan is of type ExprType.Distinct,
then the leaf of the plan is a UDF whose input comes from a PODistinct with a Project[bag][*]
in between the two. to take the tuples from the PODistinct and supply it as a bag to the UDF.
This part of the plan is modified as explained in the code comment below:
{noformat}
                // A PODistinct in the plan will always have
                // a Project[bag](*) as its successor.
                // We will replace it with a POUserFunc with "Distinct" as 
                // the underlying UDF. 
                // In the map and combine, we will make this POUserFunc
                // the leaf of the plan by removing other operators which
                // are descendants up to the leaf.
                // In the reduce we will keep descendants intact. Further
                // down in fixProjectAndInputs we will change the inputs to
                // this POUserFunc in the combine and reduce plans to be
                // just projections of the column "i"
{noformat}

The idea is that the "distinct" portion of this plan is combined and then the combined result
is fed to the UDF. So in the map and combine phases only partial distinct outputs are computed
and the UDF which takes the distinct output as input is not called. Then in the reduce, the
"combined" distinct output is fed as input to the UDF. 

All these changes are independent of the changes on other non-distinct agg inner plans which
may have Algebraic UDFs - in those plans, the existing modifications will remain.



> PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach
following a group provided there are no non-algebraics in the foreach 
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-580
>                 URL: https://issues.apache.org/jira/browse/PIG-580
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently Pig uses the combiner only when there is foreach following a group when the
elements in the foreach generate have the following characteristics:
> 1) simple project of the "group" column
> 2) Algebraic UDF
> The above conditions exclude use of the combiner for distinct aggregates - the distinct
operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic
udf). So if the following foreach should also be combinable:
> {code}
> ..
> b = group a by $0;
> c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
> {code}
> The combiner optimizer should cause the distinct to be combined and the final combine
output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message