hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
Date Mon, 23 Mar 2009 20:13:50 GMT

     [ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Richard Ding updated PIG-627:

    Attachment: multiquery-phase2_0323.patch

Thanks for reviewing the patch.

In MultiQueryOptimizer:

    * what about mr not being map only and with mr splittee? - is this not handled for now?

    _Yeah. There are two cases where splittees will not be merged into splitter: (1) splitter
is not map only and splittee has reducer, and (2) splittee has multiple roots (loads)_

    * Is the single mapper case and the single map-reduce case when the script has an explicit
store 'file' and load 'file' - if  this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(),
the store is removed - shouldn't the store remain?

    _Explicit store/load combination in a script is transformed into an implicit split, hence
the store remains_

    * There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which
should be moved to a function to reduce the code duplication.


Just want to confirm that the multi query optimization is only for map reduce mode - since
the optimizer is being called in  MapReduceLauncher


In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that
in POSplit, it causes an exception - I think it should return the error whhic would later
be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures
would be good.


spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker.


> PERFORMANCE: multi-query optimization
> -------------------------------------
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch,
multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch,
multiquery_0306.patch, multiquery_explain_fix.patch
> Currently, if your Pig script contains multiple stores and some shared computation, Pig
will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a map-reduce
job that generated output2. As the resuld data is read, parsed and filetered twice which is
unnecessary and costly. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message