hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray
Date Thu, 23 Oct 2008 19:30:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642241#action_12642241
] 

David Ciemiewicz commented on PIG-505:
--------------------------------------

Are we overthinking this problem at this time?

At this time, the only source of "undefined" values in user defined functions that I know
of are those that return maps.  (I could be wrong).

Why don't we just make the following simplifying assumptions (or conventions) for right now?

1) UDFs that return maps must return the individual values as bytearray type.  Period.
2) When casting using the lineage code, the code assumes that these are bytearray for conversion
purposes.
3) Tell me how to code my UDFs to follow these guidelines and conventions.

The other option is to introduce some cast convention that allows me to define whether the
map will adhere to a bytearray convention or a chararray convention to reduce the chance of
redundant conversions.

For example -- (map<bytearray>) or (map<chararray>).  Or maybe this is handled
intrinsically in the function definition.

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
>
>
> In Pig-335, the lineage design states that UDFs that return bytearrays could cause problems
in tracing the lineage. For UDFs that do not return bytearray, the lineage design should pickup
the right load function to use as long as there is no ambiguity.  In the current implementation,
we could have issues with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it hit a UDF
and hence cannot determine the right load function to use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message