hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-223) Optimization Idea: Dynamic histogram generation for join ordering?
Date Tue, 29 Apr 2008 15:27:56 GMT
Optimization Idea: Dynamic histogram generation for join ordering?

                 Key: PIG-223
                 URL: https://issues.apache.org/jira/browse/PIG-223
             Project: Pig
          Issue Type: Improvement
            Reporter: Pi Song

This idea sprang into my mind when I was implementing explicit casting insertion for Type


Given a query containing 3 or more joins, what is the most efficient join order? (Pig doesn't
have indexing feature so statistics are not available)


0. Start with a given plan 
1. Somehow select the first join (this is still an open question).
2. Insert histogram generator for columns used in remaining joins in the first MapReduce run.
3. Run MapReduce
4. Use histogram information generated from (2) to order joins for the rest of the plan
5. More MapReduce runs until finish.

There is another open question regarding histogram of joins based on calculated columns. In
this case calculating histogram upfront might be conflicting with the conventional optimization
technique "pulling filters up and pushing calculations down".

Not sure about usefulness because myself has never come across any 3-joins.

Any opinion?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message