hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Case for Optimization
Date Thu, 13 Dec 2007 23:13:51 GMT
For operator order we already have a "hint" mechanism: the order in  
which you give the statements!

For choosing a physical join algorithm, we could add a keyword along  
the lines of

X = join A by $0, B by $1 using fragment-and-replicate strategy;

(We have to implement fragment-and-replicate join first :)

-Chris


On Dec 13, 2007, at 11:01 AM, Amir Youssefi wrote:

> For starter we can give ability to manually specify (or suggest)  
> execution
> plan i.e. adding Hints. Next step would be for Pig to guess  
> optimizations.
>
> Cardinality hints can be used as manually hinting on statistics.
>
> Another example I ran into was type of join. Pig keeps one set of  
> data in
> memory which is good for Nested Loop join. We can hint Pig on how  
> we think
> it should do it depending on relation of data sizes etc.
>
> Perhaps we can let user provide optional Metadata and Statistics  
> (at least
> cardinality).
>
> On theoretical/academic side there seems to be a lot ahead of us on
> optimization front. I'm looking forward to contributing on that side.
>
> -Amir
>
> -----Original Message-----
> From: Chris Olston [mailto:olston@yahoo-inc.com]
> Sent: Wednesday, December 12, 2007 8:56 PM
> To: pig-dev@incubator.apache.org
> Subject: Re: Case for Optimization
>
> Yup. It would be great to sprinkle a little relational query
> optimization technology onto Pig.
>
> Given that query optimization is a double-edged sword, we might want
> to consider some guidelines of the form:
>
> 1. Optimizations should always be easy to override by the user.
> (Sometimes the system is smarter than the user, but other times the
> reverse is true, and that can be incredibly frustrating.)
>
> 2. Only "safe" optimizations should be performed, where a safe
> optimization is one that with 95% probability doesn't make the
> program slower. (An example is pushing filters before joins, given
> that the filter is known to be cheap; if the filter has a user-
> defined function it is not guaranteed to be cheap.) Or perhaps there
> is a knob that controls worst-case versus expected-case minimization.
>
> We're at a severe disadvantage relative to relational query engines,
> because at the moment we have zero metadata. We don't even know the
> schema of our data sets, much less the distributions of data values
> (which in turn govern intermediate data sizes between operators). We
> have to think about how to approach this that is compatible with the
> Pig philosophy of having metadata always be optional. It could be as
> simple as (fine, if the user doesn't want to "register" his data with
> Pig, then Pig won't be able to optimize programs over that data very
> well), or as sophisticated as on-line sampling and/or on-line
> operator reordering.
>
> -Chris
>
>
> On Dec 12, 2007, at 7:10 PM, Amir Youssefi wrote:
>
>> Comparing two pig scripts of join+filter  and filter+join I see
>> that pig has
>> an optimization opportunity of first doing filter by constraints
>> then do the
>> actual join. Do we have a JIRA open for this (or other optimization
>> scenarios)?
>>
>>
>>
>> In my case, the first one resulted in OutOfMemory exception but the
>> second
>> one runs just fine.
>>
>>
>>
>> -Amir
>>
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message