pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray
Date Sat, 25 Oct 2008 00:33:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642578#action_12642578

Santhosh Srinivasan commented on PIG-505:

Response to David's comments:

The map type in Pig was designed to hold any atomic key type (i.e., string, int, float, long,
double) and any value type. As a result, the natural representation is a Map<Object, Object>.
 The UDF has the right outputSchema implementation. UDFs that return maps should return Map<Object,

With the proposal in comment 3 (https://issues.apache.org/jira/browse/PIG-505?focusedCommentId=12642223#action_12642223),
the UDF will work as long as there are no DataByteArray values in the Map that require a cast.

Response to Pi's comments:

Treating unknowns as bytearrays will lead to run time errors which will not go away if we
treat unknowns as unknowns. The trade-off is better error handling. Specifically, in your
example, comparing 2 unknowns can be caught during type checking whereas making them bytearrays
might result in a run time error iff the two types do not match.

Summary: Treating unknowns as bytearray will result in coarser error messages. On the other
hand treating unknown as unknown will require significant changes without eliminating the
possibility of run time errors.

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
> In Pig-335, the lineage design states that UDFs that return bytearrays could cause problems
in tracing the lineage. For UDFs that do not return bytearray, the lineage design should pickup
the right load function to use as long as there is no ambiguity.  In the current implementation,
we could have issues with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it hit a UDF
and hence cannot determine the right load function to use.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message