hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
Date Mon, 23 Mar 2009 17:27:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688339#action_12688339

Pradeep Kamath commented on PIG-627:

Comments for Richard's patch - multiquery-phase2_0313.patch

In MultiQueryOptimizer:
- what about mr not being map only and with mr splittee? - is this not handled for now?
- Is the single mapper case and the single map-reduce case when the script has an explicit
store 'file' and load 'file' - if this is so, then in
mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't
the store remain?   
- There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which
should be moved to a function to reduce the code duplication.

Just want to confirm that the multi query optimization is only for map reduce mode - since
the optimizer is being called in MapReduceLauncher

In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that
in POSplit, it causes an exception - I think it should return the error whhic would later
be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures
would be good.

spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker.

The following comment in BinStorage needs to be clarified:
        if (!FileLocalizer.fileExists(fileName, storage)) {
            // At compile time in batch mode, the file may not exist
            // (such as intermediate file). Just return null - the
            // same way as we could's get a valid record from the input. --> does this
actually mean "the same way as we would if we did not get a valid record" ?
            return null;


> PERFORMANCE: multi-query optimization
> -------------------------------------
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch,
multiquery-phase2_0313.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch,
> Currently, if your Pig script contains multiple stores and some shared computation, Pig
will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a map-reduce
job that generated output2. As the resuld data is read, parsed and filetered twice which is
unnecessary and costly. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message