pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "TuringCompletePig" by AlanGates
Date Tue, 08 Jun 2010 17:46:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=1&rev2=2

--------------------------------------------------

  = Making Pig Latin Turing Complete =
  == Introduction ==
  As more users adopt Pig and begin writing their data processing in Pig Latin and as they
use Pig to process more and more complex
- tasks, a consistent request from these users is to add branches, loops, and functions to
Pig Latin.  This will enable Pig Latin to
+ tasks, a consistent request from these users has been to add branches, loops, and functions
to Pig Latin.  This will enable Pig Latin to
  process a whole new class of problems.  Consider, for example, an algorithm that needs to
iterate until an error estimate is less
  than a given threshold.  This might look like (this just suggests logic, not syntax):
  
@@ -22, +22 @@

  
  == Requirements ==
  The following should be provided by this Turing complete Pig Latin:
-  1. Branching.  This will be satisfied by a standard `if` `else if` `else` functionality
+  1. Branching.  This will be satisfied by a standard `if / else if / else` functionality
   1. Looping.  This should include standard `while` and some form of `for`.  for could be
C style or Python style (foreach).  Care needs to be taken to select syntax that does not
cause confusion with the existing `foreach` operator in Pig Latin.
   1. Functions.  
   1. Modules.
@@ -49, +49 @@

   * Which scripting language to choose?  Perl, Python, and Ruby all have significant adoption
and could make a claim to be the best choice.
   * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached
in the outer control flow.  Given that Pig jobs can run for hours this can mean spending hours
to discover a simple typo.
  
- Consider for example if built a python class that wrapped !PigServer and then translated
the above code snippet.
+ Consider for example if Pig provided a Jython class that wrapped !PigServer and then we
translated the above code snippet.
  
  {{{
      error = 100.0
@@ -68, +68 @@

              grunt.exec("fs mv 'outfile' 'infile'")
  }}}
  
- All of these references to `pig` and `grunt` as objects with command strings is undesirable.
+ All of these references to `pig` and `grunt` as objects with command strings are undesirable.
  So while I believe that embedding is a much better approach due to the lower work load and
the plethora of tools available for other
  languages, I do not believe the above is an acceptable way to do it.  Thus I would like
to place three additional requirements on
  embedded Pig Latin beyond those given above for Turing complete Pig Latin:
@@ -79, +79 @@

  This overcomes two of the three drawbacks noted above.  It does not provide for a way to
do certain optimizations such as loop
  unrolling, but I think that is acceptable.
  
+ Having rejected the quote style of programming we could choose the Domain Specific Language
(DSL) option, where we define Pig operators in the
+ target language.  Again using Python as an example:
+ 
+ {{{
+    error = 100.0
+    infile = 'original.data'
+    pig = PigServer()
+    grunt = Grunt()
+    while error > 1.0:
+        A = pig.load(infile, { 'loader' => 'piggybank.MyLoader'});
+        B = A.group(pig.ALL);
+        C = B.foreach { 
+               innerBag = doSomeCalculation(:A);
+               generate innerBag.flatten().as(:result,  :error)
+        }
+        
+        PigIterator pi = pig.openIterator(C, 'outfile');
+        output = grunt.fs.cat('outfile'");
+        bla = output.partition("\t");
+        error = bla(2)
+        if error >= 1.0:
+            grunt.fs.mv('outfile', 'infile');
+ }}}
+ 
+ This meets requirements 7 and 9 above.  It can partially but not fully meet 8.  It can check
that we use the right operators and pass
+ them the right types.  It cannot check the semantics of the operators, for example that
`infile` exists and is readable.  This might be ok,
+ because it might turn out that things that cannot be checked at script compile time should
not be checked up front anyway.  As an example, it should not 
+ check for `infile` up front because the script may not have created it yet.
+ 
+ This approach has the advantage that it will integrate very nicely with tools from the target
language.  Debuggers, IDE, etc. will all now
+ view some form of Pig Latin as native to their language.
+ 
+ It does however have drawback, which is that what we would be creating a new dialect of
Pig Latin.  There would be a Pig Latin dialect used when writing it
+ directly, and a different dialect for embedding.  This leads to confusion and duplication
of effort.  So I would like to suggest another
+ requirement:
+ 
+   1.#10 Pig Latin should appear the same in the embedded form as in the non-embedded form.
+     
- What might this look like?  Again using the script snippet at the top and embedding it in
Jython, this might look like:
+ Given all these requirements, what might this look like?  Again using the script snippet
at the top and embedding it in Jython:
  {{{
      error = 100.0
      infile = 'original.data'
@@ -120, +158 @@

  we have to pick a language that compiles to Java byte code.  That leaves us with Jython,
Jruby, Groovy, or !JavaScript.  Out of that
  field we already have half of the implementation we need in Jython with [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]].

  
+ Thoughts?  Preferences for one of the options I did not like?  Comments welcome.
+ 

Mime
View raw message