hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-58) parameterized Pig scripts
Date Wed, 06 Feb 2008 00:17:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565971#action_12565971
] 

Olga Natkovich commented on PIG-58:
-----------------------------------

This is in response to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
=============================================================================================

This is response to Alan's comment: https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
=========================================================================================================

I think it is an interesting idea and a reasonable approach but I have a few concerns:

(1) I don't think that C-style preprocessor (CPP) is necessarily that well known and understood
among our users and developers
(2) CPP is complex to use and implement and might be too heavy weight for what we are trying
to do. Briefly looked at the CPP code and it is very involved. Translating or writing on our
own chunks of it would be a fairly large project.
(3) For C, CPP is used to influence how code is compiled. For pig we are trying to influence
the run time behavior of the pig program and there are other ways to do it.  One way to do
it is to embed Pig in languages such as Perl, Python or C/C++ which would take care of code
inclusion, conditional execution and more. Similarly, we might decide to have this things
in pig language but I don't think they belong in our preprocessor. (What's the difference
between "if" and "#if" would be in pig.) So my approach is (a) simple things like parameter
substitution can be done in preprocessor. (b) more complex things happen in the language itself
or in the language in which Pig is embedded.
(4) CPP does not provide support for command execution which users asked for and just forcing
them to run it from command line has limitation in terms of parameterizing command line and
also harvesting return codes and error messages.

There are a couple of things I like from this proposal and would like to use:

(1) use #define rather than declare
(2) extend #define to also declare commands
(3) We can later further expand #define to include more things as we need them.

This way only variable names would be used outside of define which is nice since if pig later
support variables such as for scalars they would have consistent representation.

So my examples from the document would now look as follows:

(1)
 A = load '/data/mydata/$date';

(2) 
#define CMD `generate_date`
A = load '/data/mydata/$CMD';

(3)
#define CMD `generate_name $date`
A = load '/data/mydata/$CMD';

(4)
#define CMD `$cmd $date`;
A = load '/data/mydata/$CMD';

I think this also addresses some of the concerns from 

https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction
with streaming. The feature would allow pig script to include parameters that are replaced
at run time. For instance, if your script needs to run on a daily basis over the data of the
previous day, you would be able to use the script and providing a date as a run-time parameter
to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val>
construct. Multiple parameters can be specified. They are applied to the script in the order
they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before
grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace,
if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message