pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "TuringCompletePig" by AlanGates
Date Mon, 07 Jun 2010 23:49:10 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "TuringCompletePig" page has been changed by AlanGates.


New page:
= Making Pig Latin Turing Complete =
== Introduction ==
As more users adopt Pig and begin writing their data processing in Pig Latin and as they use
Pig to process more and more complex
tasks, a consistent request from these users is to add branches, loops, and functions to Pig
Latin.  This will enable Pig Latin to
process a whole new class of problems.  Consider, for example, an algorithm that needs to
iterate until an error estimate is less
than a given threshold.  This might look like (this just suggests logic, not syntax):

    error = 100.0;
    infile = 'original.data';
    while (error > 1.0) {
        A = load 'infile';
        B = group A all;
        C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
        error = foreach C generate error;
        store C into 'outfile';
        if (error > 1.0) mv 'outfile' 'infile';

== Requirements ==
The following should be provided by this Turing complete Pig Latin:
 1. Branching.  This will be satisfied by a standard `if` `else if` `else` functionality
 1. Looping.  This should include standard `while` and some form of `for`.  for could be C
style or Python style (foreach).  Care needs to be taken to select syntax that does not cause
confusion with the existing `foreach` operator in Pig Latin.
 1. Functions.  
 1. Modules.
 1. The ability to use local in memory variables in the Pig Latin script.  For example, in
the snippet given above the way `infile` is defined above the `while` and then used in the
 1. The ability to "store" results into local in memory variables.  For example, in the snippet
given above the way the error calculation from the data processing is stored into `error`
in the line `error = foreach C generate error;`.

== Approach ==
There are two possible approaches to this.  One is to add all of these features to Pig Latin
itself.  This has several advantages:
 * All Pig Latin operations will be first class objects in the language.  There will not be
a need to do quoted programming, like what happens when JDBC is used to write SQL inside a
Java program.
 * There will be opportunities to do optimizations that are not available in embedded programming,
such as loop unrolling, etc.

However, the cost of this approach is incredible.  It means turning Pig Latin into a full
scripting language.  And it means
all kinds of tools like debuggers, etc. will never be available for Pig Latin users because
the Pig team will not have the resources
or expertise to develop and maintain such tools.  And finally, does the world need another
scripting language that starts with P?

The second possible approach to this is to embed Pig Latin into an existing scripting language,
such as Perl, Python, Ruby, etc.  The
advantages of this are:
 * Most of the requirements noted above (branching, looping, functions, and modules) are present
in these languages.
 * For any of these languages whole hosts of tools such as debuggers, IDEs, etc. exist and
could be used.
 * Users do not have to learn a new language.

There are a few significant drawbacks to this approach:
 * It leads to a quoted programming style which is unnatural and irritating for developers.
 * Which scripting language to choose?  Perl, Python, and Ruby all have significant adoption
and could make a claim to be the best choice.
 * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached
in the outer control flow.  Given that Pig jobs can run for hours this can mean spending hours
to discover a simple typo.

Consider for example if built a python class that wrapped !PigServer and then translated the
above code snippet.

    error = 100.0
    infile = 'original.data'
    pig = PigServer()
    grunt = Grunt()
    while error > 1.0:
        pig.registerQuery("A = load 'infile'; \
                           B = group A all; \
                           C = foreach B generate flatten(doSomeCalculation(A)) as (result,
error); \
        PigIterator pi = pig.openIterator("C", 'outfile');
        output = grunt.exec("fs cat 'outfile'");
        bla = output.partition("\t");
        error = bla(2)
        if error >= 1.0:
            grunt.exec("fs mv 'outfile' 'infile'")

All of these references to `pig` and `grunt` as objects with command strings is undesirable.
So while I believe that embedding is a much better approach due to the lower work load and
the plethora of tools available for other
languages, I do not believe the above is an acceptable way to do it.  Thus I would like to
place three additional requirements on
embedded Pig Latin beyond those given above for Turing complete Pig Latin:
 1.#7 Pig Latin should appear as a natural part of the language it is embedded in, not as
quoted strings.
 1. Syntactic and semantic checking should be done up front before the script is run.
 1. It should be possible to write UDFs in the scripting language that Pig Latin is embedded
in and reference them from Pig Latin.

This overcomes two of the three drawbacks noted above.  It does not provide for a way to do
certain optimizations such as loop
unrolling, but I think that is acceptable.

What might this look like?  Again using the script snippet at the top and embedding it in
Jython, this might look like:
    error = 100.0
    infile = 'original.data'
    while error > 1.0:
            A = load '$infile';
            B = group A all;
            C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
            $error = foreach C generate error;
            store C into 'outfile';
        if error > 1.0:
                fs mv 'outfile' 'infile';
            infile = 'infile';

    def doSomeCalculation(A):
        for x in A:

A preprocessor could then be applied to the above that would convert this to a form of Jython
that the embedding functionality provided as part of
the `pyg.tgz` patch already attached to [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]
can run.  In other words the above would be
converted into something like the second example that uses the !PigServer interface.  This
preprocessor could also submit the
Pig Latin portions of the script for syntactic and semantic checking.  

This preprocessor would find the Pig Latin segments of code via the `PIG:` tag.  `PIG:` would
have the same block scoping rules as
other block operators in Python.  Variables from the Python code would be imported to and
exported from the Pig Latin via parameter
substitution syntax (e.g. notice how variables `infile` and `error` appear as `$infile` and
`$error` inside the Pig Latin).

The last drawback that this proposal does not address is that we have to pick a particular
scripting language to embed Pig Latin in.  There are two solution I can see here:
 1. Do our best to pick a good one and live with people's unhappiness.
 1. Write the preprocessor in such a way that it is relatively easy to switch languages it
would embed Pig Latin into.

While option two sounds nice it greatly complicates the project.  And realistically how many
people are going to take on all the
work to port it to another language?  Based on this I suggest we go with option one and choose
Jython as the scripting language.  I
vote for Jython for two reasons.  One, in order to meet the goal of embedding functions from
the scripting language directly as UDFs
we have to pick a language that compiles to Java byte code.  That leaves us with Jython, Jruby,
Groovy, or !JavaScript.  Out of that
field we already have half of the implementation we need in Jython with [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]].

View raw message