The first job process any operator before "order by". If there is
nothing before "order by", it will be only 2 jobs.
If your script is:
a = load '1.txt' as (a0:int, a1:int);
b = order a by $0;
Pig will insert a "foreach" after "load", so you will have 3 jobs, the
first job will process "foreach".
If your script is:
a = load '1.txt';
b = order a by $0;
Then you only have two jobs.
Daniel
On 05/04/2011 12:14 AM, Jeff Zhang wrote:
> Hi all,
>
> I find that a order by operation will be split into two map reduce jobs in
> pig as following.
>
> As I understand, only two mapreduce jobs is enough, the first job is sample
> job, and the second job is the real sort job.
>
> But from the below I see that the first job is a trivial job which only
> convert the data into pig's inter data format. And the next two jobs will
> use this as an input.
>
> I guess maybe this is performance consideration (pig inter data format is
> much more compact). But I doubt whether the three mapreduce jobs 's
> performance is better than two mapreduce jobs.
>
> Anyone has done such comparison ?
>
>
>
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node 1-22
> Map Plan
> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
> - 1-23
> |
> |---New For Each(false,false)[bag] - 1-18
> | |
> | Cast[chararray] - 1-15
> | |
> | |---Project[bytearray][0] - 1-14
> | |
> | Cast[int] - 1-17
> | |
> | |---Project[bytearray][1] - 1-16
> |
>
> |---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
> - 1-13--------
> Global sort: false
> ----------------
>
> MapReduce node 1-25
> Map Plan
> Local Rearrange[tuple]{tuple}(false) - 1-29
> | |
> | Constant(all) - 1-28
> |
> |---New For Each(true)[tuple] - 1-27
> | |
> | Project[int][1] - 1-26
> |
>
> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
> g.apache.pig.builtin.BinStorage','100')) - 1-24--------
> Reduce Plan
> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
> - 1-38
> |
> |---New For Each(false)[tuple] - 1-37
> | |
> | POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36
> | |
> | |---Project[tuple][*] - 1-35
> |
> |---New For Each(false,false)[tuple] - 1-34
> | |
> | Constant(444) - 1-33
> | |
> | RelationToExpressionProject[bag][*] - 1-45
> | |
> | |---Project[tuple][1] - 1-31
> |
> |---Package[tuple]{chararray} - 1-30--------
> Global sort: false
> Secondary sort: true
> ----------------
>
> MapReduce node 1-40
> Map Plan
> Local Rearrange[tuple]{int}(false) - 1-41
> | |
> | Project[int][1] - 1-19
> |
> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
> - 1-39--------
> Reduce Plan
> Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
> |
> |---New For Each(true)[tuple] - 1-44
> | |
> | Project[bag][1] - 1-43
> |
> |---Package[tuple]{int} - 1-42--------
> Global sort: true
> Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
> ----------------
>
>
|