pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Dai <jiany...@yahoo-inc.com>
Subject Re: Why order by operation is split into three map reduce jobs ?
Date Wed, 04 May 2011 23:47:40 GMT
The first job process any operator before "order by". If there is 
nothing before "order by", it will be only 2 jobs.

If your script is:

a = load '1.txt' as (a0:int, a1:int);
b = order a by $0;

Pig will insert a "foreach" after "load", so you will have 3 jobs, the 
first job will process "foreach".

If your script is:
a = load '1.txt';
b = order a by $0;

Then you only have two jobs.

Daniel

On 05/04/2011 12:14 AM, Jeff Zhang wrote:
> Hi all,
>
> I find that a order by operation will be split into two map reduce jobs in
> pig as following.
>
> As I understand, only two mapreduce jobs is enough, the first job is sample
> job, and the second job is the real sort job.
>
> But from the below I see that the first job is a trivial job which only
> convert the data into pig's inter data format. And the next two jobs will
> use this as an input.
>
> I guess maybe this is performance consideration (pig inter data format is
> much more compact). But I doubt whether the three mapreduce jobs 's
> performance is better than two mapreduce jobs.
>
> Anyone has done such comparison ?
>
>
>
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node 1-22
> Map Plan
> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
> - 1-23
> |
> |---New For Each(false,false)[bag] - 1-18
>      |   |
>      |   Cast[chararray] - 1-15
>      |   |
>      |   |---Project[bytearray][0] - 1-14
>      |   |
>      |   Cast[int] - 1-17
>      |   |
>      |   |---Project[bytearray][1] - 1-16
>      |
>
> |---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
> - 1-13--------
> Global sort: false
> ----------------
>
> MapReduce node 1-25
> Map Plan
> Local Rearrange[tuple]{tuple}(false) - 1-29
> |   |
> |   Constant(all) - 1-28
> |
> |---New For Each(true)[tuple] - 1-27
>      |   |
>      |   Project[int][1] - 1-26
>      |
>
> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
> g.apache.pig.builtin.BinStorage','100')) - 1-24--------
> Reduce Plan
> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
> - 1-38
> |
> |---New For Each(false)[tuple] - 1-37
>      |   |
>      |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36
>      |   |
>      |   |---Project[tuple][*] - 1-35
>      |
>      |---New For Each(false,false)[tuple] - 1-34
>          |   |
>          |   Constant(444) - 1-33
>          |   |
>          |   RelationToExpressionProject[bag][*] - 1-45
>          |   |
>          |   |---Project[tuple][1] - 1-31
>          |
>          |---Package[tuple]{chararray} - 1-30--------
> Global sort: false
> Secondary sort: true
> ----------------
>
> MapReduce node 1-40
> Map Plan
> Local Rearrange[tuple]{int}(false) - 1-41
> |   |
> |   Project[int][1] - 1-19
> |
> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
> - 1-39--------
> Reduce Plan
> Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
> |
> |---New For Each(true)[tuple] - 1-44
>      |   |
>      |   Project[bag][1] - 1-43
>      |
>      |---Package[tuple]{int} - 1-42--------
> Global sort: true
> Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
> ----------------
>
>


Mime
View raw message