hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amir Youssefi" <am...@yahoo-inc.com>
Subject RE: Case for Optimization
Date Thu, 13 Dec 2007 19:01:44 GMT
For starter we can give ability to manually specify (or suggest) execution
plan i.e. adding Hints. Next step would be for Pig to guess optimizations.

Cardinality hints can be used as manually hinting on statistics. 

Another example I ran into was type of join. Pig keeps one set of data in
memory which is good for Nested Loop join. We can hint Pig on how we think
it should do it depending on relation of data sizes etc. 

Perhaps we can let user provide optional Metadata and Statistics (at least
cardinality). 

On theoretical/academic side there seems to be a lot ahead of us on
optimization front. I'm looking forward to contributing on that side. 

-Amir

-----Original Message-----
From: Chris Olston [mailto:olston@yahoo-inc.com] 
Sent: Wednesday, December 12, 2007 8:56 PM
To: pig-dev@incubator.apache.org
Subject: Re: Case for Optimization

Yup. It would be great to sprinkle a little relational query  
optimization technology onto Pig.

Given that query optimization is a double-edged sword, we might want  
to consider some guidelines of the form:

1. Optimizations should always be easy to override by the user.  
(Sometimes the system is smarter than the user, but other times the  
reverse is true, and that can be incredibly frustrating.)

2. Only "safe" optimizations should be performed, where a safe  
optimization is one that with 95% probability doesn't make the  
program slower. (An example is pushing filters before joins, given  
that the filter is known to be cheap; if the filter has a user- 
defined function it is not guaranteed to be cheap.) Or perhaps there  
is a knob that controls worst-case versus expected-case minimization.

We're at a severe disadvantage relative to relational query engines,  
because at the moment we have zero metadata. We don't even know the  
schema of our data sets, much less the distributions of data values  
(which in turn govern intermediate data sizes between operators). We  
have to think about how to approach this that is compatible with the  
Pig philosophy of having metadata always be optional. It could be as  
simple as (fine, if the user doesn't want to "register" his data with  
Pig, then Pig won't be able to optimize programs over that data very  
well), or as sophisticated as on-line sampling and/or on-line  
operator reordering.

-Chris


On Dec 12, 2007, at 7:10 PM, Amir Youssefi wrote:

> Comparing two pig scripts of join+filter  and filter+join I see  
> that pig has
> an optimization opportunity of first doing filter by constraints  
> then do the
> actual join. Do we have a JIRA open for this (or other optimization
> scenarios)?
>
>
>
> In my case, the first one resulted in OutOfMemory exception but the  
> second
> one runs just fine.
>
>
>
> -Amir
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research




Mime
View raw message