hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Dai <jiany...@yahoo-inc.com>
Subject Re: split operator
Date Mon, 23 Aug 2010 23:50:34 GMT
Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

Daniel

Gang Luo wrote:
> Hi Daniel,
> This is a question long ago, but I suddenly come up with some more thoughts on 
> this. In a query as simple as this:
>
> A = LOAD 'input';
> B = FILTER A BY $1 == 1;
> C = COGROUP A BY $0, B BY $0;
>
> the optimizer will insert a split operator to reuse A. According to the source 
> code, a map-reduce job will be ended when it sees split and output the result to 
> A1 and A2 which will be used by two subsequent jobs to process B and C. In this 
> case, the first job does nothing meaningful but copy the souce 'input' twice. Is 
> there some optimization applied here (like the MultiQueryOptimizer you mentioned 
> previously) ? How?
>
> Since I didn't take a look at the MultiQueryOptimizer, it will be great help if 
> you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <jianyong@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 4:58:49 下午
> 主   题: Re: split operator
>
> Hi, Gang,
> It is about multiquery optimization. In MRCompiler, we will create a
> map-reduce boundary for split, later in MultiQueryOptimizer, we will
> merge several split into one map-reduce job. In this map-reduce job, we
> will nest several split plans.
>
> Daniel
>
> Gang Luo wrote:
>   
>> Hi Daniel,
>> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
>> operator maintain one-tuple buffer for each branch and talks about how to 
>> synchronize multiple branches. I do think that is the in-memory split.
>>
>> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>>
>>
>> -Gang
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人: Daniel Dai <jianyong@yahoo-inc.com>
>> 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org>
>> 发送日期: 2010/7/26 (周一) 2:09:25 下午
>> 主   题: Re: split operator
>>
>> Hi, Gang,
>> Which part of the paper are you talking about? We don't do in-memory split. We 
>>     
>
>
>   
>> dump the split result to a temporary file and start a new map-reduce job. Split 
>>
>>
>> do create a map-reduce boundary (Though it is not entirely true, multiquery 
>> optimizer may combine some of these jobs)
>>
>> Daniel
>>
>> Gang Luo wrote:
>>  
>>     
>>> Hi all
>>> according to the vldb 09 paper, the split operator and all its successive 
>>> operators reside in memory without any blocking in between. However, the source

>>>
>>>
>>> code (version 0.7) shows that a MR job is actually ended when it meets the 
>>> split 
>>>
>>> operator and multiple new MR jobs are created, each representing one branch.

>>> This write-once-read-multiple-times method is different from the in-memory 
>>> method mentioned in that paper. Does pig change the strategy for split, or is

>>>       
>
>
>   
>>> there still an in-memory version of split I didn't discover?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>        
>>>    
>>>       
>>      
>>  
>>     
>
>
>       
>   


Mime
View raw message