hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-359) Semantics of generate * have changed
Date Wed, 27 Aug 2008 04:05:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625967#action_12625967

Shravan Matthur Narayanamurthy commented on PIG-359:

Alan, two things. 
1) The current code isn't enough because of the following:
A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY(*,*);
dump B;

Trunk emits 14(2 times the artiy of each tuple in A which is 7). The current code would emit
two. Another example of what current code doesn't handle is

A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY($0, '---', *);
Trunk emits 9(2 + 7). Current code would emit 3.

2) You are right in saying that 'a' will be double wrapped. But thats how trunk works right
now and I think its right because consider this script:

A = load 'myfile' as (a:tuple(...), b:tuple(...));
B = foreach A generate udf(a,b);

We want 'a', 'b' to be intact inside the tuple input that is being passed to the UDF. So we
would expect the arity to be two instead of 2 times the arity of 'a' & 'b'. Generalizing
this, I think double wrapping should be ok. The way I tested this behaviour in trunk is by
writing a UDF that returns a Tuple say TupleOutputUDF, which just copies the input tuple to
the output. I tried the following script in trunk:
A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY(TupleOutputUDF(*));
dump B;

with a return value of 1. The current code returns 7.

> Semantics of generate * have changed
> ------------------------------------
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>         Attachments: 359-1.patch, 359.patch
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include
an implicit flatten.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message