hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul <mrid...@yahoo-inc.com>
Subject Re: [jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query
Date Thu, 18 Dec 2008 00:21:38 GMT

iirc, the last time support for combiners were added, Utkarsh unearthed 
a bunch of bugs (and so the restricted use of combiners in pig) ... cant 
access the testcases in the patch, but hopefully they are also covered !

Regards,
Mridul

Pradeep Kamath (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>
> Pradeep Kamath updated PIG-563:
> -------------------------------
>
>     Status: Patch Available  (was: Open)
>
> Changes are in two main places:
> 1) CombinerOptimizer which decides whether to use the combiner and also modifies the
Map/combine/reduce plans to use the combiner
> 2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and COUNT
>
> The CombinerOptimizer is changed as follows:
> The combiner is used only in the case of a group by followed by foreach generate <simple
project>*, <algebraic udf>* where <simple project> is the projection of the
group by key (not a nested project like group.$0). Two new foreachs are inserted - one  in
the combine and one in the map plan which will be based on the reduce foreach.  The map foreach
will have one inner plan for each  inner plan in the foreach we're duplicating.  For projections,
the plan will be the same.  For algebraic udfs, the plan will have the initial version of
the function.  The combine foreach will have one inner plan for each inner plan in the foreach
we're duplicating.  For projections, the project operators will be changed to project the
same column as its position in the foreach. For algebraic udfs, the plan will have the intermediate
version of the function. In the inner plans of the reduce foreach for projections, the project
operators will be changed to project the same column as its position in the foreach. For algebraic
udfs, the plan will have the final version of the function. The input to the udf will be a
POProject which will project the column corresponding to the position of the udf in the foreach.
> The map plan is changed by replacing the existing Local rearrange with a special operator
POPreCombinerLocalRearrange which behaves like the regular local rearrange in the getNext()
as far as getting its input and constructing the "key" out of the input. It then returns a
tuple with two fields - the key in the first position and the "value" inside a bag in the
second position. This output format resembles the format out of a Package. This output will
feed to the map foreach which expects this format. Then a normal local rearrange will be attached
as the leaf of the map plan with a project as its input which projects the key from the map
foreach. The combine plan will have the POCombiner package (formerly POPOstCombinerPackage),
the combiner foreach and a local rearrange. The reduce plan will have a POCombiner package
and the modified foreach at its root.
>
> The UDFs are changed to have correct implementations for Initial, Intermediate and Final.
TestBuiltin has also been changed to test this new setup.
>
>
>   
>> PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is
used for a pig query
>> ------------------------------------------------------------------------------------------------------
>>
>>                 Key: PIG-563
>>                 URL: https://issues.apache.org/jira/browse/PIG-563
>>             Project: Pig
>>          Issue Type: Improvement
>>    Affects Versions: types_branch
>>            Reporter: Pradeep Kamath
>>            Assignee: Pradeep Kamath
>>             Fix For: types_branch
>>
>>
>> Currently Pig's use of the combiner assumes the combiner is called exactly once in
Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. This issue is to
track changes needed in the CombinerOptimizer visitor and the builtin Algebraic UDFS (SUM,
COUNT, MIN, MAX, AVG) to be able to work in this new model.
>>     
>
>   




Mime
View raw message