hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Woody Anderson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-928) UDFs in scripting languages
Date Thu, 04 Mar 2010 23:13:41 GMT

    [ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841599#action_12841599

Woody Anderson commented on PIG-928:

I don't think there is _any_ measurable overhead to the reflection mechanism in the example
I provided. The objects are allocated "a few" times due to the schema interrogation logic
of pig (something that might deserve an entire other bug thread of discussion, as i have no
idea why X copies of a UDF have to be allocated for this).
When it comes time to run (i.e. where it really counts), there is a single invocation of the
factory pattern followed by "huge" (data set derived) number of calls to that function. The
UDF that is called is fully built an fully initialized with final variables etc, facilitating
maximal streamlined execution.
There are certainly things about the approach i took, but language selection overhead is not
one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised.

A secondary point to the whole idea of needing some script language code other than, say BSF
or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner.
Each engine unfortunately consumes and produces objects in its own object model. If either
of these frameworks had bothered to mandate converting input/output to java.util things would
at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but
this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion
from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance
But, even for rudimentary correctness, type conversion must be implemented for each, at which
point, a wrapping pattern that selects an appropriate function factory is a necessary pattern

Orthogonal to the above point: The idea of trying to support multiple script languages vs.
a few. I am personally not of the same mind as you guys i guess.
I think there is near zero 'overhead' perf cost for supporting some unspecified language.
Languages continually evolve and new languages emerge that utilize the JVM better and better.
I certainly agree that, at this time, jython and jruby seem the best. However, to say that
clojure or javascript, or whatever are not going to move forward and potentially become more
effectively integrated with the JVM is a bit premature.

I would make the sacrifice if the ability to support multiple languages was actually that
hard, or had an actual serious performance cost.
I just don't think those two issues are real.

The performance costs come from the individual scripting engine features with respect to byte-code
compliation, function referencing, string manipulation, execution caching etc.,  and their
type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought about this, as
have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving
forward that offers integration of PIG with any script language. It (importantly) offers to
put those languages in PIG instead of the other way around, and it allows for multiple interpreter
contexts and even multiple languages.

I'll quote Arnab's quick description here:
DEFINE CMD `SCRIPTBLOCK` script('javascript')
This is identical to the commandline streaming syntax, and follows gracefully in the style
of the "ship" and "cache" keywords. 

Thus your javascript example becomes
function split(a) {
  return a.split(" ");
` script('JAVASCRIPT');
Note the use of backticks is consistent with the current syntax, and is unlikely to occur
in common scripts, so it saves us the escaping. Also it allows newlines in the code. 
The goal is to create namespaces -- you can now call your function as "JSBlock.split(a)".
This allows us to have multiple functions in one block. 

This idea, coupled with the ability to register files and directories directly (e.g. register
foo.py;) provides the ability to load code into an arbitrary namespace/interpreter-scope,
load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause it to execute
in the same was as:
  define spig_split org.apache.pig.scripting.Eval('jython','split','b:{tt:(t:chararray)}');

i don't see that as particularly difficult modification of the function rationalization logic
of pig. Actually, i think it's a general improvement as it cuts down on object allocations.

In the event that this methodology is adopted, you are then still free to write projects that
stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well
with multiple script engines.

I see it as quite achievable to support any given language with near zero overhead above the
lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be mixed together,
even within the same script
I think that, overall this is highly preferable to a single or otherwise finite language situation
(though i advocate possibly auto-supporting jython/jruby)

> UDFs in scripting languages
> ---------------------------
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, scripting.tgz, scripting.tgz
> It should be possible to write UDFs in scripting languages such as python, ruby, etc.
 This frees users from needing to compile Java, generate a jar, etc.  It also opens Pig to
programmers who prefer scripting languages over Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message