hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai" <dai...@gmail.com>
Subject PushUpProject optimization design
Date Mon, 03 Aug 2009 19:43:28 GMT
Pig need to prune unused columns as early as possible. Here is a design 
about how to implement it in Pig.

* Prune columns of loader, save time for record parsing

   a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
   b = foreach a generate n1, n2;
     => a = load 'a' (n1:chararray, n2:chararray)
We do not need to parse n3 in our loader.

* Prune columns across map-reduce boundary (between map-reduce jobs or 
inter map-reduce jobs), save bandwidth

        a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
        b = group a by n1;
        c = sort b by n2;
        d = foreach c generate n2, n3;

     => a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
        b = group a by n1;
        b1 = foreach b generate n2, n3;
        c = sort b1 by n2;
   d = foreach c generate n2, n3;
*.Prune column within map-reduce boundary does not seem to be helpful

        store a into 'a';
        b = filter a by n1='1';
        c = foreach b generate n2;
        dump c;

     => store a into 'a';
   a1 = foreach a generate n1, n2;
        b = filter a1 by n1='1';
        c = foreach b generate n2;
   dump c;

In this case, an extra foreach step is processed, but we gain no benefit.

Algorithm description:
1. Divide all logical operators into two categories: create map-reduce 
boundary and not create map-reduce boundary.

        boundary = true: LOCoGroup, LOCross, LOJoin, LODistinct, LOSort
        boundary = false: LOFilter, LOForEach, LODefine, LOLoad, 
LOStore, LOSplit, LOSplitOutput, LOStream, LOUnion
  LOJoin can be boundary or not, depends on the type of join

2. We collect required fields from the bottom, a reverse dependency 
order walker algorithm is required to do this

3. We do not actually start from the leaf. We start from the last 
LOForEach?. Only LOForEach? prune columns. If there is no LOForEach? in 
the script, then we cannot prune anything.

4. From a required output, we need an algorithm to figure required input

                                        <= require $0, $2, $3
        b = foreach a generate $0, $2+$3;
                                        <= require $0, $1

5. From the bottom LOForEach?, we collect required fields all the way 
up, if we move over a boundary operator, save the position because it is 
possible to put projection there

     => projection here?
        x = CoGroup .....
     => projection here?
        y = order ......

Put the projection right before boundary to make sure fewer data cross 
the boundary

However, we do not make this decision and do the actual prune now, we 
will do the actual pruning top down

6. While we traversing up, if we see operator containing more than one 
inputs, we trace required fields in all directions; We rely on the 
output schema of this operator to figure out which required fields 
belong to which input. If we see operator containing more than one 
outputs, we collects required fields until all outputs has been traced

7. If we see LOStream, LOStore, we stop

8. If we see LOLoad, we stop and set required fields in LOLoad

9. From LOLoad, we do a top down traverse to decide whether we need to 
put projection, and if yes, insert ForEach?

10. We only add projection if it is necessary. It is only necessary when 
the required fields of that boundary operator is more than output fields 
of operator before it.

        Filter ...... (output fields: n1, n2, n3)
                                            <= we can prune n3 here
        x = CoGroup .... (required fields: n1, n2)

11. It is possible that we create a foreach which can be combined into 
previous foreach, however, we do not handle it in PushUpProject rule

                      <= we will add a ForEach anyway here
        x = CoGroup .....

12. Everytime we insert a LOForEach?, we need to adjust the projection 
map all the way down

13. To fit the PushUpProject into current optimizor framework, we hook 
the check rule to LOForEach?. Everytime we start from LOForEach? and we 
never push up over another LOForEach?. So we stop at LOForEach? and save 
required fields upto this point.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message