pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.
Date Tue, 24 Sep 2013 23:46:03 GMT

    [ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776931#comment-13776931
] 

Daniel Dai commented on PIG-2417:
---------------------------------

I compiled on both my RHEL6 and Windows, seems fine for me. We can change the javadoc anyway
to fix the issue.
                
> Streaming UDFs -  allow users to easily write UDFs in scripting languages with no JVM
implementation.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2417
>                 URL: https://issues.apache.org/jira/browse/PIG-2417
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Jeremy Karn
>            Assignee: Jeremy Karn
>             Fix For: 0.12.0
>
>         Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, PIG-2417-7.patch,
PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, PIG-2417-9.patch, PIG-2417-e2e.patch,
streaming2.patch, streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages
with no JVM implementation or a limited JVM implementation.  The initial proposal is outlined
here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF from an
embedded JVM UDF.  I'd propose something like the following (although I'm not sure 'language'
is the best term to be using):
> {code}define my_streaming_udfs language('python') ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to the cluster
which is responsible for reading the input stream, deserializing the input data, passing it
to the user written script, serializing that script output, and writing that to the output
stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This class will
likely share some of the existing code in POStream and ExecutableManager (where it make sense
to pull out shared code) to stream data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream
operator directly.  This would involve inserting the POStream operator instead of the POUserFunc
operator whenever we encountered a streaming UDF while building the physical plan.  This approach
seemed problematic because there would need to be a lot of changes in order to support POStream
in all of the places we want to be able use UDFs (For example - to operate on a single field
inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message