hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Semantics of generate *
Date Fri, 03 Oct 2008 17:47:41 GMT
A thought and a question.

The thought:  rather than doing each individual operator do the 
translation, could a visitor be written that would walk the tree right 
after parsing and break project( * ) into project(1), project(2)...  ?  
This visitor could be one of the validators (like the type checker).  
This way all of the logic for this restitching is in one place.

The question:  is the inability to return multiple projections from one 
production a limit of how the parser is implemented or the tool, javacc, 
used for the parser?


Santhosh Srinivasan wrote:
> In the current implementation of generate * in the front end, a single
> projection operator with the star attribute set to true is created.
> During the schema computation, instead of generating the schema of the
> projection input, a tuple that contains the schema of the projection
> input is created. This results in double wrapping. An example will
> illustrate the problem.
> grunt> a = load 'one' using PigStorage(' ') as (field1, field2, field3);
> grunt> b = load 'two' as (field4, field5, field6);
> grunt> c = cogroup a by $0, b by $0;
> grunt> d = foreach c generate *;
> grunt> describe d;
> d: {c: (group: bytearray,a: {field1: bytearray,field2: bytearray,field3:
> bytearray},b: {field4: bytearray,field5: bytearray,field6: bytearray})}
> In the above example, the schema for operator d should have been
> identical to that of operator c. Instead, the schema of operator c is
> wrapped in a tuple and embedded within the schema of d. As a result, we
> have a couple of issues:
> 1. It is not intuitive to users that the schema of c and d are not
> identical. They should be identical.
> grunt> e = foreach d generate group;
> 2008-10-02 16:06:11,335 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Invalid
> alias: group in {c: (group: bytearray,a: {field1: bytearray,field2:
> bytearray,field3: bytearray},b: {field4: bytearray,field5:
> bytearray,field6: bytearray})}
> 2. As a workaround, we could flatten the contents of d and then access
> the contents of c.
> grunt> e = foreach d generate flatten($0);
> grunt> e = foreach d generate flatten($0);
> grunt> describe e;
> e: {c::group: bytearray,c::a: {field1: bytearray,field2:
> bytearray,field3: bytearray},c::b: {field4: bytearray,field5:
> bytearray,field6: bytearray}}
> However, we will not be able to compute the lineage of the fields of
> relation, as demonstrated by the following example:
> grunt> f = foreach e generate flatten(a), flatten(b);
> grunt> g = foreach f generate field1 + 1;
> grunt> describe g;
> 2008-10-02 16:26:20,655 [main] WARN  org.apache.pig.PigServer -
> bytearray is implicitly casted to integer under LOAdd Operator
> 2008-10-02 16:26:20,655 [main] ERROR org.apache.pig.PigServer - Problem
> resolving LOForEach schema Cannot resolve load function to use for
> casting from bytearray to integer. Found more than one load function to
> use: [org.apache.pig.builtin.PigStorage,
> org.apache.pig.builtin.BinStorage]
> This problem is contained in the frontend alone. In the backend, the
> double wrapping issue is resolved with the bug PIG-359. In order to
> resolve this issue in the frontend, the project( * ) operator has to be
> translated into project(0), project(1), ..., project(n - 2), project(n -
> 1); where n is the number of columns in the relation.
> The translation of project( * ) into the multiple project operators
> cannot be performed in the parser without major modifications. Each
> relational operator that has an inner plan, can perform this
> translation. In the current design, LOForEach, LOCogroup, LOSplitOutput
> LOSort and LOFilter have inner plans.
> There are corner cases that need to be handled during the translation.
> If the schema of the project's input is not defined then the schema of
> the relation or the column in the relation that contains the projection
> could become undefined.
> a = laod 'one';
> b = load 'two';
> c = foreach a generate *, $0, $1; -- schema of c is undefined
> d = cogroup a by *, by by ($0, $1); -- schema of column named group in
> cogroup is undefined; also arity checking cannot be enforced
> Thoughts?
> Thanks,
> Santhosh

View raw message