hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-335) Casting does not work in certain cases with multiple loads
Date Thu, 04 Sep 2008 23:07:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628510#action_12628510

Santhosh Srinivasan commented on PIG-335:

The proposed design for computing the lineage for load functions:

The FieldSchema class will include an additional member variable that will contain a  list
of parent/ancestral canonical names. The list of parent canonical names corresponds to the
canonical names required by the operator to compute the field schema.

The parent list will be empty for canonical names that originate from the load statement and
remain unchanged as they move from operator to operator. Only expressions (like arithmentic
expressions, etc) will create new canonical names.

The load operator corresponding to the parent canonical name is required only to cast byte
arrays into Pig types. Other than UDFs, there are no operators that generate byte arrays.
CONCAT (also an UDF) can generate byte arrays. For now, its an UDF.

To compute the load function associated with field schema, each canonical name in the parent
list of canonical names is matched against the operator responsible for the canonical name.
If the operator is an UDF, then we throw an exception as we will not know how to convert a
byte array generated by the UDF into a Pig type. The check bubbles up the graph until we hit
the load operator corresponding to the canonical name under question.

Breakdown of the changes:

1. The logic mentioned in the previous paragraph will reside in the type checker. 
2. The changes to the FieldSchema will (of course) be in limited to the FieldSchema class.

3. The computation of the list of the parent canonical names will happen in each logical operator.

Thoughts/comments on the proposed design are welcome.

> Casting does not work in certain cases with multiple loads
> ----------------------------------------------------------
>                 Key: PIG-335
>                 URL: https://issues.apache.org/jira/browse/PIG-335
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Santhosh Srinivasan
>            Priority: Critical
>             Fix For: types_branch
> Given a script like:
> A = load 'bla' as (x, y) using Loader1();
> B = load 'morebla' as (s, t) using Loader2();
> C = cogroup A by x, B by s;
> D = foreach C generate flatten(A), flatten(B);
> E = foreach D generate x, y, t + 1;
> In this case, in the last foreach, a cast will need to be added to t + 1 to allow t (a
byte array) to be added to an integer.  We use load functions to handle this late casting.
 The issue is that we do not currently have a way to know whether to use Loader1 or Loader2
to cast the data.  We need to track the lineage of fields so that the cast operator can select
the correct loader.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message