hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-335) Casting does not work in certain cases with multiple loads
Date Tue, 30 Sep 2008 22:53:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635886#action_12635886

Alan Gates commented on PIG-335:


This new method of tracking lineage of data through the script is complex.  It would be good
to add a couple of paragraphs to the class level comments in Schema.java describing how it

LOProject, around line 206, you added mFieldSchema.setParent(null, expressionOperator).  If
I understand the code correctly this is the case where you are projecting star from a relational
operator.  Why is the parent canonical name null in this case?  And what are the ramifications
of that?

If the user writes a query like:

A = load 'Alpha' using MyLoadFunc;
B = load 'Beta' using TheirLoadFunc;
C = cogroup A by $0, B by $0;
D = foreach c generate group + 1;

they will get "Found more than one load function interface to use: MyLoadFunc, TheirLoadFunc"
as an error message.  That doesn't make clear what the issue is (of course there's more than
one load func interface, I gave you two load funcs!).  Something like:  "Cannot resolve load
function to use for casting $0 to integer, two possibilities:  MyLoadFunc, TheirLoadFunc"
would be much more helpful.

Same with some of the other error messages that just mention load func interface.  They should
at the very least mention that they're trying to find the right cast to use.

In TypeCheckingVisitor.getLoadFunc(LogicalOperator, String) I see a list of relational operators
(Filter, etc.).  But I don't see Cogroup, Union, or Cross in that list.  How are you tracing
data that comes through those operators?  Those are the ones with the special case, where
if the load functions match we know how to do the cast, and if they don't match we don't know.
 But I don't see where they're tracing the lineage of their data.

> Casting does not work in certain cases with multiple loads
> ----------------------------------------------------------
>                 Key: PIG-335
>                 URL: https://issues.apache.org/jira/browse/PIG-335
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Santhosh Srinivasan
>            Priority: Critical
>             Fix For: types_branch
>         Attachments: PIG_335.patch, PIG_335_1.patch
> Given a script like:
> A = load 'bla' as (x, y) using Loader1();
> B = load 'morebla' as (s, t) using Loader2();
> C = cogroup A by x, B by s;
> D = foreach C generate flatten(A), flatten(B);
> E = foreach D generate x, y, t + 1;
> In this case, in the last foreach, a cast will need to be added to t + 1 to allow t (a
byte array) to be added to an integer.  We use load functions to handle this late casting.
 The issue is that we do not currently have a way to know whether to use Loader1 or Loader2
to cast the data.  We need to track the lineage of fields so that the cast operator can select
the correct loader.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message