hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Le Dem (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-928) UDFs in scripting languages
Date Sat, 13 Mar 2010 22:34:27 GMT

     [ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Le Dem updated PIG-928:
------------------------------

    Attachment: pyg.tgz

Hi,
I'm attaching something I implemented last year. I cleaned it up and updated the dependency
to Pig 0.6.0 for the occasion.
There's probably some overlap with previous posts, sorry about the late submission.
Here is my approach.
I wanted to make easier a couple of things:
 - writing programs that require multiple calls to pig
 - UDFs
 - parameter passing to Pig
So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts)
could be in one python script.
example: python/tc.py in the attachment

The script defines Python functions that are available as UDFs to pig automatically. The decorator
@outputSchema is an easy way to specify what the output schema of the UDF is.
example (see script): @outputSchema("relationships:{t:(target:chararray, candidate:chararray)}"
Also notice that the UDFs use the standard python constructs: tuple, dictionary and list.
they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python
very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets
automatically mapped to the arguments.

Then the script defines a main() function that will be the main program executed on the client.
In the main the Python program has access to a global pig variable that provides two methods
(for now) and is designed to be an equivalent to PigServer.
List<ExecJob> executeScript(String script)
to execute a pig script in-lined in Python
deleteFile(String filename)
to delete a file
This looks a little bit like the JDBC approach where you "query" Pig and then can process
the data.

also you can embed python expressions in the pig statements using ${ ... }
example: ${n - 1}
They get executed in the current scope and replaced in the script. 

To run the example (assuming javac, jar and java are in your PATH):
 - tar xzvf pyg.tgz
 - add pig-0.6.0-core.jar to the lib folder
 - ./makejar.sh
 - ./runme.sh

It runs the following:
org.apache.pig.pyg.Pyg local tc.py

tc.py is a python script that performs a transitive closure on a list of relation using an
iterative algorithm. It defines python functions

Limitations:
 - you can not include other python scripts but this should be doable.
 - I haven't spent much time testing performance. I suspect the Pig<->Python type conversion
to be a little slow as it creates many new objects. It could possibly be improved by making
the Pig objects implement the Python interfaces.

(the attachment contains jython.jar 2.5.0 for simplicity)

Best regards, Julien

> UDFs in scripting languages
> ---------------------------
>
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, pyg.tgz, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, ruby, etc.
 This frees users from needing to compile Java, generate a jar, etc.  It also opens Pig to
programmers who prefer scripting languages over Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message