hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray
Date Thu, 23 Oct 2008 17:24:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642206#action_12642206

Alan Gates commented on PIG-505:

A couple of comments:

You say that the long term plan is to have a true unknown type, and then pose the problem
as do we want to start the switch now or later.  (In fairness, I'm pretty sure you're quoting
something I said here, so I'm about to question my own statement.)  I don't know if that's
true or not.  In the original design for types we had an unknown type.  We ended up dropping
it in implementation because it turned out to be so similar to the bytearray.  While there
is some cost to combining byte arrays with unknown types (as you lay out) I'm not sure that
that means we should should separate the two.  The long term cost of maintainability may be

I'm a confused by the first con of continuing to use byte arrays as unknowns.  Are you saying
that if we do this, in the case where there is only one load function in the script, after
a UDF returns what is really a byte array, we'll use the cast from that load function?  I'm
not certain what the right course is here.  From a correctness viewpoint, we can argue that
pig doesn't know whether that byte array is from the load function or from the UDF.  However,
this is a little burdensome to the user because it means any byte arrays inside complex types
have to be dealt with before going to a UDF.  The con of using the load function where possible
is if the byte array really is from the UDF and not the load function, we may error out or
worse silently produce wrong data.  Since silently producing wrong data is a mortal sin in
data processing I'd come down on the side of not using the load function's cast here.

A question, if we allow unknown in this one case, do we truly have to change code everywhere?
 Instead of adding a getNext(unknown) to all operators, could we instead add a CastFromUnknown
operator?  The entry point would still be getNext(ByteArray), so from all outside code's viewpoint
the current type system should remain untouched.  And this operator would be written to introspect
the type of object it got and either pass it on as is if it's the right type or cast it to
the right type if it can.  It would never use a load function's cast (assuming we choose as
indicated above), and it wouldn't incur the cost of throwing and catching an exception on
the cast, it could use instanceof instead (which should be much faster).

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
> In Pig-335, the lineage design states that UDFs that return bytearrays could cause problems
in tracing the lineage. For UDFs that do not return bytearray, the lineage design should pickup
the right load function to use as long as there is no ambiguity.  In the current implementation,
we could have issues with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it hit a UDF
and hence cannot determine the right load function to use.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message