pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@cloudera.com>
Subject Re: Request for feedback: cost-based optimizer
Date Thu, 03 Sep 2009 13:53:35 GMT
Daniel, thanks for the information, this is useful.

On Wed, Sep 2, 2009 at 2:06 PM, Jianyong Dai<jianyong@yahoo-inc.com> wrote:
> Yes, physical properties is important for an optimizer. To optimize Pig
> well, we need to know the underlying hadoop execution environment, such as #
> of map-reduce jobs, how many maps/reducers, how the job is configured, etc.
> This is true even for a rule based optimizer. Unfortunately, physical layer
> does not provide much physical information as the name suggests. Basically
> physical layer is a rephrase of logical layer using physical operators.
> Compare to logical operators, physical operators include implementation of
> pipeline processing but strip away many logical details such as "schema".
> Also, in logical layer, we have infrastructure to restructure logical
> operator such as move nodes around, swap nodes, etc, which does not exist in
> physical layer. From optimizer's point of view, physical layer does not give
> necessary information but more harder to deal with. If you would like to
> work with physical details, I think map-reduce layer is the right place to
> look at. However, restructure map-reduce layer is hard cuz we do not have
> all the infrastructure to move things around. Another approach is to use a
> combined logical layer and map-reduce layer for the optimization. In this,
> you restructure the logical layer by observing the physical details from
> map-reduce layer. The down side is that we have to tightly couple Pig to
> hadoop. But now Pig is a subproject of hadoop and almost all Pig users are
> using hadoop, I think it is fine to optimize thing towards hadoop.
> Dmitriy Ryaboy wrote:
>> Our initial survey of related literature showed that the usual place
>> for a CBO tends to be between the physical and logical layer (in fact,
>> the famous Cascades paper advocates removing the distinction between
>> physical and logical operators altogether, and using an "is_logical"
>> and "is_physical" flag instead -- meaning an operator can be one,
>> both, or neither).
>> The reasoning is that you cannot properly determine a cost of a plan
>> if you don't know the physical "properties" of the operators that
>> implement it. An optimizer that works at a logical layer would by
>> definition create the same plan whether in local or mapreduce mode
>> (since such differences are abstracted from it). This is clearly
>> incorrect, as the properties of the environment in which these plans
>> are executed are drastically different.  Working at the physical layer
>> lets us stay close to the iron and adjust based on the specifics of
>> the execution environment.
>> Certainly one can posit a framework for a CBO that would set up the
>> necessary interfaces and plumbing for optimizing in any execution
>> mode, and invoke the proper implementations at run time; we are not
>> discounting that possibility (haven't gotten quite that far in the
>> design, to be honest).  But we feel that the implementations have to
>> be execution mode specific.
>> -Dmitriy
>> On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai<jianyong@yahoo-inc.com>
>> wrote:
>>> I am still reading but one interesting question is why you decide to put
>>> CBO
>>> in physical layer?
>>> Dmitriy Ryaboy wrote:
>>>> Whoops :-)
>>>> Here's the Google doc:
>>>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>>>> -Dmitriy
>>>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<sms@yahoo-inc.com>
>>>> wrote:
>>>>> Dmitriy and Gang,
>>>>> The mailing list does not allow attachments. Can you post it on a
>>>>> website and just send the URL ?
>>>>> Thanks,
>>>>> Santhosh
>>>>> -----Original Message-----
>>>>> From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
>>>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>>>> To: pig-dev@hadoop.apache.org
>>>>> Subject: Request for feedback: cost-based optimizer
>>>>> Hi everyone,
>>>>> Attached is a (very) preliminary document outlining a rough design we
>>>>> are proposing for a cost-based optimizer for Pig.
>>>>> This is being done as a capstone project by three CMU Master's students
>>>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>>>> necessarily meant for immediate incorporation into the Pig codebase,
>>>>> although it would be nice if it, or parts of it, are found to be useful
>>>>> in the mainline.
>>>>> We would love to get some feedback from the developer community
>>>>> regarding the ideas expressed in the document, any concerns about the
>>>>> design, suggestions for improvement, etc.
>>>>> Thanks,
>>>>> Dmitriy, Ashutosh, Tejal

View raw message