hadoop-pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Algebraic UDFs in Pig
Date Tue, 16 Dec 2008 19:23:34 GMT
You could have the abstract class do:

> exec {
>    initial();
>    final();
> }

so that by default the code paths are the same, and let subclasses override
that method if they want to do a specific performance optimization.

-Chris


On 12/16/08 10:11 AM, "Alan Gates" <gates@yahoo-inc.com> wrote:

> +1 for 1.  We definitely want to enable the performance
> optimizations.  And the Con listed for one (double code
> implementations) is minimal in the cases where the writer isn't going
> to make performance optimizations because exec() can be done as:
> 
> exec {
>    initial();
>    final();
> }
> 
> This is a very minor burden.
> 
> Alan.
> 
> On Dec 15, 2008, at 10:52 AM, Pradeep Kamath wrote:
> 
>> Hi,
>> 
>>   Currently the Algebraic interface allows a UDF writer to have an
>> Initial, Intermediate and Final class (each of which should implement
>> EvalFunc). The idea is that the UDF can be called in stages -
>> Initial.exec() in the map, Intermediate.exec() in the combiner and
>> Final.exec() in the Reduce. The UDF (say COUNT) which implements
>> Algebraic, also extends EvalFunc. This means that it has an exec()
>> method. Currently Pig calls this exec() method at the top level
>> when the
>> UDF is not "combinable". When it is "combinable", Pig currently calls
>> Initial.exec() in the combine and Final.exec() in the Reduce. I
>> will be
>> changing the "combinable" case to call Initial.exec() in the map,
>> Intermediate.exec() in the combine and Final.exec() in the reduce as
>> part of https://issues.apache.org/jira/browse/PIG-563.
>> 
>> 
>> 
>> There are two options for the Non combinable case:
>> 
>> 1)       The way it is described above - top level UDF's exec() is
>> called when combiner is not used and if combiner is used,
>> Initial.exec()
>> is called in the map, Intermediate.exec() in the combine and
>> Final.exec() in the reduce.
>> 
>> *         Pros:
>> 
>> a.       Initial.exec() can be optimized with the knowledge that it is
>> only called in the map. For example, in UDFs like COUNT, since
>> Initial.exec() is always going to be called in map, the implementation
>> can be optimized to simply emit Integer 1.
>> 
>> *         Cons:
>> 
>> a.       UDF writer has to potentially write two different code
>> paths -
>> one where UDF.exec() computes the result completely in the reduce()
>> and
>> another where Initial.exec(), Intermediate.exec() and Final.exec()
>> compute the result in parts in the map, combine and reduce
>> respectively.
>> 
>> 
>> 
>> 2)       If a UDF implements Algebraic, Pig will have to guarantee
>> that
>> Initial.exec() will be called and later Final.exec() will be
>> called. If
>> the UDF is combinable, these will be called from map and reduce
>> respectively and Intermediate.exec() will be called from the
>> combine. If
>> the UDF is NOT combinable, Initial.exec() will be called first in the
>> reduce, then its output will be put in a bag and supplied to a call of
>> the Final.exec(). In both the cases the top level exec() of the UDF
>> will
>> never be called.
>> 
>> *         Pros:
>> 
>> a.       The guarantee that Initial.exec() and Final.exec() are called
>> in both combinable and non combinable cases.
>> 
>> *         Cons:
>> 
>> a.       The UDF writer has to give a dummy implementation for
>> UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is
>> never
>> called.
>> 
>> b.       UDF writer should make sure the Initial.exec() and
>> Final.exec()
>> work in both the combinable and non combinable cases.
>> 
>> c.       There are performance penalties - in the combinable case, the
>> Initial.exec() cannot be optimized since there is no guarantee that it
>> is always called in the map. In the non combinable case, the call to
>> Initial.exec() will contain all input and hence the result can be
>> computed in that call itself. Despite this, Pig will have to take the
>> result of Initial.exec(), put it in a bag and call Final.exec() which
>> can be highly inefficient.
>> 
>> 
>> 
>> I would vote for option 1 since it is much better from a performance
>> angle.
>> 
>> 
>> 
>> Please provide Comments/Suggestions on the proposal.
>> 
>> 
>> 
>> Thanks,
>> 
>> Pradeep
>> 
>> 
>> 
> 

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research





Mime
View raw message