pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Woody Anderson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
Date Tue, 03 May 2011 07:24:03 GMT

     [ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Woody Anderson updated PIG-1942:
--------------------------------

    Attachment: 1942.patch

I wanted to get this started, as this is a bit of a change.

often, it seems that people misuse the outputSchema annotation such that the output does not
match the specified schema. At least, there was a unit test that did this, and it's possible
that a few users in the wild have this issue as well.

At any rate, this patch includes code in JythonUtils that will coerce jythout object model
output into the schema that the function is annotated with.

It's faster than the existing code and has quite a bit more functionality. It can convert
arrays and many more types than previously. It also makes it much easier and faster to convert
[1,2,3] to a bag rather than in jython create [(1), (2), (3)].

Given that this changes the functionality of udfs that use @outputSchema (by coercing schema
adherence), we may want to use a different annotation, and allow outputSchema to exist in
it's previous form, in that it doesn't actually convert the schema.


> script UDF (jython) should utilize the intended output schema to more directly convert
Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1942
>                 URL: https://issues.apache.org/jira/browse/PIG-1942
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Woody Anderson
>            Priority: Minor
>              Labels: python, schema, udf
>             Fix For: 0.10
>
>         Attachments: 1942.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
>         return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output schema is
known, and it would be quite simple to implicitly promote the string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple
are equally convertable to Tuple, this conversion can be done in a much less rigid way with
the use of the schema.
> this allows much more facile re-use of existing python code and less memory overhead
to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the jython script
framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message