pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Le Dem (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-928) UDFs in scripting languages
Date Sun, 21 Mar 2010 21:49:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847986#action_12847986

Julien Le Dem commented on PIG-928:


The main advantage of embedding pig calls in the scripting language is that it enables iterative
algorithms, which Pig is no very good at currently. Why would we limit users to UDFs when
they can have their whole program in their scripting language of choice?

4. Python is a very interesting language to integrate with Pig because it has all the same
native data structures (tuple:tuple, list:bag, dictionary:map) which makes the UDFs compact
and easy to code. That said, in scripting languages that don't match as well as Python to
the Pig types, using the schema to disambiguate will be a must have.
When do we need to convert sequences and iterators ? Pig has only tuple, bag and map as complex
types AFAIK.
5. agreed, It should be cached or initialised at the begining.
3. and 6. I'll investigate passing the main script through the classpath when I have time.
One interpreter would be nice to save memory and initialization time. I'm not sure the shared
state is such an advantage as UDFs should not rely on being run in the same process. Maybe
I'm just missing something.

About the multi language: I'm not against it, but there's not that much code to share.
The scripting<->pig type conversion is specific to each language as you mentioned. also
calling functions, getting a list of functions, defining output schemas will be specific.

How I see the multilanguage:

pig local|mapred -script {language} {scriptfile}

main program:
- generic: loads the sript file
- generic: makes the script available in the classpath of the tasks (through a jar generated
on the fly?)
- specific: initializes the interpreter for the scripting language
- specific: adds the global variables defined by pig for the main (in my case: decorators,
pig server instance)
- generic: loads the script in the interpreter
- specific: figures out the list of functions and registers them automatically as UDFs in
PIG using a dedicated UDF wrapper class
- specific: run the main

Pig execute call from the script:
- generic: parse the Pig string to replace ${expression} by the value of the expression as
evaluated by the interpreter in the local scope.

UDF init:
- generic: loads the script from the classpath
- specific: initializes the interpreter for the scripting language
- specific: add the global variables defined by pig for the UDFs (in my case: decorators)
- generic: loads the script in the interpreter
- specific: figures out the runtime for the outputSchema: function call or static schema (parsing
of schema generic)

UDF call:
- specific: convert a pig tuple to a parameter list in the scripting language types
- specific: call the function with the parameters
- specific: convert the result to Pig types
- generic: return the result

> UDFs in scripting languages
> ---------------------------
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, pyg.tgz, scripting.tgz, scripting.tgz
> It should be possible to write UDFs in scripting languages such as python, ruby, etc.
 This frees users from needing to compile Java, generate a jar, etc.  It also opens Pig to
programmers who prefer scripting languages over Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message