hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Making use of the combiner in the new pig pipeline
Date Wed, 16 Jul 2008 00:41:21 GMT
One of the changes we want to make in the new pig pipeline is to make 
much more aggressive use of the combiner.  In thinking through when we 
should use the combiner, I came up with the following.  The list is not 
exhaustive, but it includes common expressions and should be possible to 
implement within a week or so.  If you can think of other rules that fit 
this profile, please suggest them.

1) Filters. 

    a) If the predicate does not operate on any of the bags (that is, it
    only operates on the grouping key) then the filter will be relocated
    to the combiner phase.  For example:

    b = group a by $0;
    c = filter b by group != 'fred';
    ...

    In this case subsequent operations to the filter could also be
    considered for pushing into the combiner.

    b) If it operates on the bags with an algebraic function, then a
    foreach with the initial function will be placed in the combiner
    phase and the filter in the reduce phase will be changed to use the
    final function.  For example:

    b = group a by $0;
    c = filter b by count(a) > 0;
    ...

2) Foreach. 

    a) If the foreach does not contain a nested plan and all UDFs in the
    generate statement are algebraic, then the foreach will be copied
    and placed in the combiner phase.  The version of the foreach in the
    combiner stage will use the initial function, and the version in the
    reduce stage     will be changed to use the final function.  For
    example:

    b = group a by $0;
    c = foreach b generate group, group + 5, sum(a.$1);

    b) If the foreach has an inner plan that has a distinct and no
    filters, then it will be left as is in the reduce plan and a
    combiner plan will be created that runs the inner plan minus the
    generate on the tuples, thus creating the distinct portion of the
    data without applying the UDF.  For example:

    b = group a by $0;
    c = foreach b {
        c1 = distinct $1;
        generate group, COUNT(c1);
    }

3) Distinct.  This will be converted to apply the distinct in the 
combiner as well as in the reducer.

4) Limit.  The limit will be applied in the combiner and again in the 
reducer.

Alan.

Mime
View raw message