hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianyong Dai <jiany...@yahoo-inc.com>
Subject Re: Request for feedback: cost-based optimizer
Date Wed, 02 Sep 2009 18:06:01 GMT
Yes, physical properties is important for an optimizer. To optimize Pig 
well, we need to know the underlying hadoop execution environment, such 
as # of map-reduce jobs, how many maps/reducers, how the job is 
configured, etc. This is true even for a rule based optimizer. 
Unfortunately, physical layer does not provide much physical information 
as the name suggests. Basically physical layer is a rephrase of logical 
layer using physical operators. Compare to logical operators, physical 
operators include implementation of pipeline processing but strip away 
many logical details such as "schema". Also, in logical layer, we have 
infrastructure to restructure logical operator such as move nodes 
around, swap nodes, etc, which does not exist in physical layer. From 
optimizer's point of view, physical layer does not give necessary 
information but more harder to deal with. If you would like to work with 
physical details, I think map-reduce layer is the right place to look 
at. However, restructure map-reduce layer is hard cuz we do not have all 
the infrastructure to move things around. Another approach is to use a 
combined logical layer and map-reduce layer for the optimization. In 
this, you restructure the logical layer by observing the physical 
details from map-reduce layer. The down side is that we have to tightly 
couple Pig to hadoop. But now Pig is a subproject of hadoop and almost 
all Pig users are using hadoop, I think it is fine to optimize thing 
towards hadoop.


Dmitriy Ryaboy wrote:
> Our initial survey of related literature showed that the usual place
> for a CBO tends to be between the physical and logical layer (in fact,
> the famous Cascades paper advocates removing the distinction between
> physical and logical operators altogether, and using an "is_logical"
> and "is_physical" flag instead -- meaning an operator can be one,
> both, or neither).
>
> The reasoning is that you cannot properly determine a cost of a plan
> if you don't know the physical "properties" of the operators that
> implement it. An optimizer that works at a logical layer would by
> definition create the same plan whether in local or mapreduce mode
> (since such differences are abstracted from it). This is clearly
> incorrect, as the properties of the environment in which these plans
> are executed are drastically different.  Working at the physical layer
> lets us stay close to the iron and adjust based on the specifics of
> the execution environment.
>
> Certainly one can posit a framework for a CBO that would set up the
> necessary interfaces and plumbing for optimizing in any execution
> mode, and invoke the proper implementations at run time; we are not
> discounting that possibility (haven't gotten quite that far in the
> design, to be honest).  But we feel that the implementations have to
> be execution mode specific.
>
> -Dmitriy
>
> On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai<jianyong@yahoo-inc.com> wrote:
>   
>> I am still reading but one interesting question is why you decide to put CBO
>> in physical layer?
>>
>> Dmitriy Ryaboy wrote:
>>     
>>> Whoops :-)
>>> Here's the Google doc:
>>>
>>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>>>
>>> -Dmitriy
>>>
>>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<sms@yahoo-inc.com>
>>> wrote:
>>>
>>>       
>>>> Dmitriy and Gang,
>>>>
>>>> The mailing list does not allow attachments. Can you post it on a
>>>> website and just send the URL ?
>>>>
>>>> Thanks,
>>>> Santhosh
>>>>
>>>> -----Original Message-----
>>>> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
>>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>>> To: pig-dev@hadoop.apache.org
>>>> Subject: Request for feedback: cost-based optimizer
>>>>
>>>> Hi everyone,
>>>> Attached is a (very) preliminary document outlining a rough design we
>>>> are proposing for a cost-based optimizer for Pig.
>>>> This is being done as a capstone project by three CMU Master's students
>>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>>> necessarily meant for immediate incorporation into the Pig codebase,
>>>> although it would be nice if it, or parts of it, are found to be useful
>>>> in the mainline.
>>>>
>>>> We would love to get some feedback from the developer community
>>>> regarding the ideas expressed in the document, any concerns about the
>>>> design, suggestions for improvement, etc.
>>>>
>>>> Thanks,
>>>> Dmitriy, Ashutosh, Tejal
>>>>
>>>>
>>>>         
>>     


Mime
View raw message