hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan" <...@yahoo-inc.com>
Subject RE: Semantics of generate *
Date Fri, 03 Oct 2008 18:49:10 GMT
[Santhosh] Yes, a visitor is probably a cleaner way to do the
translation.


The question:  is the inability to return multiple projections from one 
production a limit of how the parser is implemented or the tool, javacc,

used for the parser?

[Santhosh] It's the design/implementation and not the tool. Compared to
1.x, the types branch does not have the equivalent of a StarSpec. As a
result, we do not distinguish between project(0) and project( * ) during
parse time.

Thanks,
Santhosh
 

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Friday, October 03, 2008 10:48 AM
To: pig-dev@incubator.apache.org
Subject: Re: Semantics of generate *

A thought and a question.

The thought:  rather than doing each individual operator do the 
translation, could a visitor be written that would walk the tree right 
after parsing and break project( * ) into project(1), project(2)...  ?  
This visitor could be one of the validators (like the type checker).  
This way all of the logic for this restitching is in one place.

The question:  is the inability to return multiple projections from one 
production a limit of how the parser is implemented or the tool, javacc,

used for the parser?

Alan.

Santhosh Srinivasan wrote:
> In the current implementation of generate * in the front end, a single
> projection operator with the star attribute set to true is created.
> During the schema computation, instead of generating the schema of the
> projection input, a tuple that contains the schema of the projection
> input is created. This results in double wrapping. An example will
> illustrate the problem.
>
> grunt> a = load 'one' using PigStorage(' ') as (field1, field2,
field3);
> grunt> b = load 'two' as (field4, field5, field6);
> grunt> c = cogroup a by $0, b by $0;
> grunt> d = foreach c generate *;
> grunt> describe d;
>
> d: {c: (group: bytearray,a: {field1: bytearray,field2:
bytearray,field3:
> bytearray},b: {field4: bytearray,field5: bytearray,field6:
bytearray})}
>
> In the above example, the schema for operator d should have been
> identical to that of operator c. Instead, the schema of operator c is
> wrapped in a tuple and embedded within the schema of d. As a result,
we
> have a couple of issues:
>
> 1. It is not intuitive to users that the schema of c and d are not
> identical. They should be identical.
>
> grunt> e = foreach d generate group;
>
> 2008-10-02 16:06:11,335 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Invalid
> alias: group in {c: (group: bytearray,a: {field1: bytearray,field2:
> bytearray,field3: bytearray},b: {field4: bytearray,field5:
> bytearray,field6: bytearray})}
>
> 2. As a workaround, we could flatten the contents of d and then access
> the contents of c.
>
> grunt> e = foreach d generate flatten($0);
> grunt> e = foreach d generate flatten($0);
> grunt> describe e;
>
> e: {c::group: bytearray,c::a: {field1: bytearray,field2:
> bytearray,field3: bytearray},c::b: {field4: bytearray,field5:
> bytearray,field6: bytearray}}
>
> However, we will not be able to compute the lineage of the fields of
> relation, as demonstrated by the following example:
>
> grunt> f = foreach e generate flatten(a), flatten(b);
> grunt> g = foreach f generate field1 + 1;
> grunt> describe g;
>
> 2008-10-02 16:26:20,655 [main] WARN  org.apache.pig.PigServer -
> bytearray is implicitly casted to integer under LOAdd Operator
> 2008-10-02 16:26:20,655 [main] ERROR org.apache.pig.PigServer -
Problem
> resolving LOForEach schema Cannot resolve load function to use for
> casting from bytearray to integer. Found more than one load function
to
> use: [org.apache.pig.builtin.PigStorage,
> org.apache.pig.builtin.BinStorage]
>
> This problem is contained in the frontend alone. In the backend, the
> double wrapping issue is resolved with the bug PIG-359. In order to
> resolve this issue in the frontend, the project( * ) operator has to
be
> translated into project(0), project(1), ..., project(n - 2), project(n
-
> 1); where n is the number of columns in the relation.
>
> The translation of project( * ) into the multiple project operators
> cannot be performed in the parser without major modifications. Each
> relational operator that has an inner plan, can perform this
> translation. In the current design, LOForEach, LOCogroup,
LOSplitOutput
> LOSort and LOFilter have inner plans.
>
> There are corner cases that need to be handled during the translation.
> If the schema of the project's input is not defined then the schema of
> the relation or the column in the relation that contains the
projection
> could become undefined.
>
> a = laod 'one';
> b = load 'two';
> c = foreach a generate *, $0, $1; -- schema of c is undefined
> d = cogroup a by *, by by ($0, $1); -- schema of column named group in
> cogroup is undefined; also arity checking cannot be enforced
>
> Thoughts?
>
> Thanks,
> Santhosh
>   

Mime
View raw message