hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1333) API interface to Pig
Date Thu, 10 Jun 2010 21:44:15 GMT

    [ https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877585#action_12877585
] 

Richard Ding commented on PIG-1333:
-----------------------------------

Thanks for reviewing the patch. My comments are inline.

bq. (1) General comment. This patch is very large and combines multiple different issues that
could have been separated into multiple patches to make it easier to review and test

In this patch JobStats and OutputStats replaced the HJob as the return value of execution
of a pig script. This results in several deprecated/removed classes. I also go through the
unit tests to replace most of the deprecated methods used there and this causes modification
of  many test classes. 

bq. (2) We are missing script level feature collection. (I see the one at job level.) For
each script, we want to collect overall script features such as different operators: join,
order by, etc., is it a multiquery, does it have UDF. Also, we would want to know if combiner
was used and whether the script spilled but maybe both of those can be at the job level.

I'll add the script level features, in addition to the job level feature. The value of job
level feature (pig.job.feature is the key in job xml) will be a string, but the value of script
level feature (pig.script.features) will be a bit set represented by a long.

bq. (3) We need to add separate comment to the JIRA marked as documentation that describes
PigRunner since it is a new interface that we need to include in 0.8.0 documentation.

Will do.

bq. (4) MapReduceLauncher. Why was exception handling and temp store handling code removed?

Exception handling is still there. Handling of temp stores for failed jobs is moved to JobStats.
 

bq. (5) OutputStats assumes that location is a path which might not be true for non-file stores.

It isn't necessary to use path in OutputStats (just trying to get a more readable short name
for location). I'll move it.

bq. (6) ScriptState: There are maps/hashes optimized for enums (http://java.sun.com/j2se/1.5.0/docs/api/java/util/EnumMap.html)

Dmitriy also commented on this. I'll modified the usage of emums based on the comments.

bq. (7) Why JobStats is derived from an operator?

The JobGraph of the script is a derived class from BaseOperatorPlan (in experimental package)
whose elements are Operators. JobStats is derived from Operator in order to use the methods
on the base class. 

bq. (8) Why did JOB_NAME_PREFIX got removed from PigContext?

JOB_NAME_PREFIX is still there. What is removed is the private member 'jobName' which isn't
used in the class and whose getter and setter had been removed long time ago. 

bq. (9) Why do we need to synchronize getTemporaryFile?

I'll remove the synchronize on this method. It was copied from the corresponding deprecated
method.

> API interface to Pig
> --------------------
>
>                 Key: PIG-1333
>                 URL: https://issues.apache.org/jira/browse/PIG-1333
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1333.patch
>
>
> It would be nice to make Pig more friendly for applications like workflow that would
be executing pig scripts on user behalf.
> Currently, they would have to use pig command line to execute the code; however, this
has limitation on the kind of output that would be delivered. For instance, it is hard to
produce error information that is easy to use programatically or collect statistics.
> The proposal is to create a class that mimics the behavior of the Main but gives users
a status object back. The the main code of pig would look somethig like:
> public static void main(String args[])
> {
>     PigStatus ps = PigMain.exec(args);
>     exit (PigStatus.rc);
> }
> We need to define the following:
> - Content of PigStatus. It should at least include
>    * return code
>    * error string
>    * exception 
>    * statistics
> - A way to propagate the status class through pig code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message