hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray
Date Thu, 23 Oct 2008 18:36:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642223#action_12642223

Santhosh Srinivasan commented on PIG-505:

Responses with paragraph numbers:

Paragraph 2: 

The current lineage code barfs if the load function is null for converting bytearrays to Pig
type. As a result, we have to pick a load function to use resulting in run time errors or
erroneous results. Based on your comment, it seems appropriate to relax the rule that load
functions cannot null for bytearray to Pig type conversions and then throw an appropriate
error message at run time (assuming no bugs in the lineage code)

Paragraph 3:

The inputs to cast expression can serve as inputs to any operators that expects expressions.
As a result, setting the return type of expression operator to unknown will have across the
board impact. In order to mitigate this impact, we could introduce a new visitor that changes
the type of all expressions that are not inputs to cast to bytearray, However, this introduces
a problem. When do we use this visitor? Before the type checker or after the type checker?
If we use the visitor before the type checker the we will lose unknown types for casts introduced
by the type checker. If we use the visitor after the type checker, the type checker will barf
if unknown types occur in the graph. As a result, we will have to either migrate some of the
functionality of the type checker into the visitor. This approach is complicated and not worth
the benefit.

Based on the discussions and given the cost implications of code complexity, maintenance and
performance, the solution is probably the following:

1. Relax the rule of load function not being null in the lineage code.
2. If a null pointer exception occurs in the back end (POCast, specifically) then we assume
that it was due to a bytearray created by a UDF and report an appropriate error message.

The only constraint to this solution is the assumption that the lineage code is not buggy.
If the lineage code is buggy and we end up with a null load function for the right bytearray
to Pig type conversion, it will require investigation.

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
> In Pig-335, the lineage design states that UDFs that return bytearrays could cause problems
in tracing the lineage. For UDFs that do not return bytearray, the lineage design should pickup
the right load function to use as long as there is no ambiguity.  In the current implementation,
we could have issues with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it hit a UDF
and hence cannot determine the right load function to use.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message