hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gunther Hagleitner (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
Date Thu, 26 Mar 2009 04:53:51 GMT

     [ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gunther Hagleitner updated PIG-627:
-----------------------------------

    Attachment: fix_store_prob.patch

This patch addresses an issue with the way we deal with scripts that do:
{{{
...
store a into 'foo';
a = load 'foo';
...
}}}

In the logical plan this will end up as a split with one branch storing into 'foo' and the
other continuing the processing after the load. The actual load is removed.

This works well but has an unfortunate side effect. If the store/load mark the boundary between
two map-reduce jobs the MRCompiler has to insert a tmp store-load bridge - which means that
we now end up with two stores.

This fix detects this case in the optimizing phase after the compilation. It removes the unnecessary
store and loads from the other one.


> PERFORMANCE: multi-query optimization
> -------------------------------------
>
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch,
merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch,
multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch,
multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared computation, Pig
will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a map-reduce
job that generated output2. As the resuld data is read, parsed and filetered twice which is
unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message