hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-223) Optimization Idea: Dynamic histogram generation for join ordering?
Date Sun, 04 May 2008 00:04:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594044#action_12594044

Pi Song commented on PIG-223:

That's right. Collecting meta data will help a lot. There are 2 cases:-
1. User directs the meta data creation. This is like creating indexes in RDBMS
2. Dynamic meta data creation. This may happen as a part of optimization when user runs an
adhoc query. 

> Optimization Idea: Dynamic histogram generation for join ordering?
> ------------------------------------------------------------------
>                 Key: PIG-223
>                 URL: https://issues.apache.org/jira/browse/PIG-223
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Pi Song
> This idea sprang into my mind when I was implementing explicit casting insertion for
Type Checking.
> Problem:
> Given a query containing 3 or more joins, what is the most efficient join order? (Pig
doesn't have indexing feature so statistics are not available)
> Solution:
> 0. Start with a given plan 
> 1. Somehow select the first join (this is still an open question).
> 2. Insert histogram generator for columns used in remaining joins in the first MapReduce
> 3. Run MapReduce
> 4. Use histogram information generated from (2) to order joins for the rest of the plan
> 5. More MapReduce runs until finish.
> There is another open question regarding histogram of joins based on calculated columns.
In this case calculating histogram upfront might be conflicting with the conventional optimization
technique "pulling filters up and pushing calculations down".
> Not sure about usefulness because myself has never come across any 3-joins.
> Any opinion?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message