hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-920) optimizing diamond queries
Date Fri, 30 Oct 2009 00:14:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771712#action_12771712
] 

Pradeep Kamath commented on PIG-920:
------------------------------------

In MultiQueryOptimizer.java (the numbers in the code blocks below are line numbers):

It would be good to add some comments in the following code on why the plan size should be
2 or 3
and what the POForEach is
{noformat}
 223             if (pl.size() == 2 || pl.size() == 3) {
   224                 PhysicalOperator root = pl.getRoots().get(0);
   225                 PhysicalOperator leaf = pl.getLeaves().get(0);
   226                 if (root instanceof POLoad && leaf instanceof POStore) {
   227                     if (pl.size() == 3) {
   228                         PhysicalOperator mid = pl.getSuccessors(root).get(0);
   229                         if (mid instanceof POForEach) {
   230                             rtn = true;
   231                         }
   232                     } else {
   233                         rtn = true;
   234                     }
   235                 }
   236             }
   237         }
{noformat}


Just to be safe it might be better to check that there is only 1 successor before this code:
{noformat}
 265                 PhysicalOperator opSucc = succ.mapPlan.getSuccessors(op).get(0);
{noformat}

Is the following by design even in the case where multiple successors are present for splitter?
{noformat}
 309         return 1;
{noformat}


> optimizing diamond queries
> --------------------------
>
>                 Key: PIG-920
>                 URL: https://issues.apache.org/jira/browse/PIG-920
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>         Attachments: PIG-920.patch
>
>
> The following query
> A = load 'foo';
> B = filer A by $0>1;
> C = filter A by $1 = 'foo';
> D = COGROUP C by $0, B by $0;
> ......
> does not get efficiently executed. Currently, it runs a map only job that basically reads
and write the same data before doing the query processing.
> Query where the data is loaded twice actually executed more efficiently.
> This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message