hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Woody Anderson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-928) UDFs in scripting languages
Date Fri, 05 Mar 2010 22:19:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842062#action_12842062

Woody Anderson commented on PIG-928:

Java reflection is very doable, it's kind of a pain i guess, but you could definitely do it.
I think using BeanShell might be a way to use java syntax if you want to, but jython and jruby
also are quite good at allowing you to call java code very easily and naturally.
What kind of reflection system are you thinking? passing a string as input to some function?
or finding someway to assume you can make certain method calls on the objects that represent
various data object in pig. e.g.  $0.split("."), assuming $0 is a chararray/string.
or are you thinking something that equates to:
def splitter java.util.regex.Pattern("\.");
A = foreach B generate splitter.split($0);

to have it perform at 'peak', you'd need to wrap the reflection into the constructor and cache
the java.lang.reflect.Method object.
it wouldn't be too hard to write (the assumed impl uses constructor args to determine the
correct Method via reflection):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.",
'String', 'b:{tt:(t:chararray)}');
A = foreach B generate split($0);

to be more 'generic' but less performant, you could do it more like this (the assumed impl
uses less info to simply reflect a particular object):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.");
A = foreach B generate split('split', $0);

the issue here is that each invocation has to determine the correct Method object (after the
first it's probably highly cacheable), also since the method might change as a result of a
different name or different args, the lookup might also produce a different output schema.
At any rate, i think you could write reasonably peformant caching code for this solution,
but it'd be more complicated and a tag slower than the former approach.
Mainly i've tried in all of my impls to do as little as possible in the exec() method, and
try to make most objects in use final and immutable (e.g. build them all in the constructor).

you could of course go so far as to delay the creation of the actual Pattern object (i.e.
where you first present the split pattern "\."). Again, it lends itself to performance degrading
coding patterns, but if you're careful with your actions, i think you could get most of it
back with appropriately cached objects. Doing this in a completely generic fashion.. i'll
think about it i guess, i think there's more overhead here than in the other approaches, but
if your lib function is more than 'split', the overhead might not be noticeable. Of course,
you could implement each of these abstractions levels and use them judiciously.

anyway, there are a lot of options here, are these in line with what you were thinking?

> UDFs in scripting languages
> ---------------------------
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, scripting.tgz, scripting.tgz
> It should be possible to write UDFs in scripting languages such as python, ruby, etc.
 This frees users from needing to compile Java, generate a jar, etc.  It also opens Pig to
programmers who prefer scripting languages over Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message