hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-223) Optimization Idea: Dynamic histogram generation for join ordering?
Date Tue, 29 Apr 2008 20:36:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593104#action_12593104

Olga Natkovich commented on PIG-223:

I think as Pig becomes more mature, we will start collecting and storing needed metadata such
as data sizes, column cordinality, sort/partition order of the data, etc.

Trying to dynamically compute the information if it is not available sounds like a good idea.

> Optimization Idea: Dynamic histogram generation for join ordering?
> ------------------------------------------------------------------
>                 Key: PIG-223
>                 URL: https://issues.apache.org/jira/browse/PIG-223
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Pi Song
> This idea sprang into my mind when I was implementing explicit casting insertion for
Type Checking.
> Problem:
> Given a query containing 3 or more joins, what is the most efficient join order? (Pig
doesn't have indexing feature so statistics are not available)
> Solution:
> 0. Start with a given plan 
> 1. Somehow select the first join (this is still an open question).
> 2. Insert histogram generator for columns used in remaining joins in the first MapReduce
> 3. Run MapReduce
> 4. Use histogram information generated from (2) to order joins for the rest of the plan
> 5. More MapReduce runs until finish.
> There is another open question regarding histogram of joins based on calculated columns.
In this case calculating histogram upfront might be conflicting with the conventional optimization
technique "pulling filters up and pushing calculations down".
> Not sure about usefulness because myself has never come across any 3-joins.
> Any opinion?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message