hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
Date Wed, 11 Mar 2009 19:12:52 GMT

    [ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680997#action_12680997

Pradeep Kamath commented on PIG-627:

Sorry about the misunderstanding, I think I looked at a different patch. After reviewing the
right patch, here are some comments:

The patch throws Java Exceptions like IllegalStateException. This should be replaced with
the appropriate Exception class (like MRCompilerException) as specified in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification.
The exception should be created with the error code, error source and error message constructor.
New error codes should be introduced if one of the existing ones in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6
cannot be used. If new codes are introduced, the wiki table should be updated.

The following can be used to check for file existence in BinStorage.determineSchema() - only
in the case where the file does not exist, null should be returned
 public static boolean fileExists(String filename, DataStorage store)
            throws IOException {
        ElementDescriptor elem = store.asElement(filename);
        return elem.exists() || globMatchesFiles(elem, store);

Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I wonder if we should
have a ReverseDependencyOrderWalker since that is what the rootsFirst == false case will be.
If we are not visiting roots to leaf, we really are not visiting in a dependency order - so
the meaning of dependency order is no longer honored - this can be confusing I think. By explicitly
naming the walker ReverseDependencyOrderWalker, the intent of walking from leaves to roots
is more clear I think.

In POSplit currently there is a PhysicalPlan representing the merged inner plans (where all
plans are mutually exclusive) and there is also a List<PhysicalPlan> which has the same
information in the form of a List. In the rest of pig code, inner plans have always been modelled
as List<PhysicalPlan>. For consistency, it is better to just have a List<PhysicalPlan>
to represent the inner plans.

> PERFORMANCE: multi-query optimization
> -------------------------------------
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch,
multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch
> Currently, if your Pig script contains multiple stores and some shared computation, Pig
will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a map-reduce
job that generated output2. As the resuld data is read, parsed and filetered twice which is
unnecessary and costly. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message