hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-772) Semantics of Filter statement inside ForEach should support filtering on aliases used in the Group statement preeceding it
Date Tue, 21 Apr 2009 00:33:47 GMT

    [ https://issues.apache.org/jira/browse/PIG-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701003#action_12701003
] 

Viraj Bhat commented on PIG-772:
--------------------------------

There seems to be a workaround for the same, but the question is does the below Pig script
perform better than the nested Pig script in the original description. In fact there are potentially
big performance advantages if the filter statement allowed the semantics in the Original description
of this Jira. This will also  avoid multiple redundant passes though the data.

{code}
A = LOAD 'half.txt' AS (key:CHARARRAY, val:INT);
B = GROUP A BY key;
C = foreach B { N = AVG(A.val); generate group, flatten(A.val), N as N;};
D = filter C by val >= N;
E = foreach D generate group, val;
F = group E by group;
G = foreach F generate group, E;
dump G
{code}

Input: half.txt
===================
A       1
A       2
A       3
B       1
B       3
====================
Result:
====================
(A,{(A,2),(A,3)})
(B,{(B,3)})
====================

> Semantics of Filter statement inside ForEach should support filtering on aliases used
in the Group statement preeceding it
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-772
>                 URL: https://issues.apache.org/jira/browse/PIG-772
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.3.0
>            Reporter: Viraj Bhat
>            Priority: Minor
>             Fix For: 0.3.0
>
>
> I have  a Pig script which tries to display all bags which are greater than the average
value in the group.
> {code}
> A = LOAD 'half.txt' AS (key:CHARARRAY, val:INT);
> B = GROUP A BY key;
> C = FOREACH B {
>        N = AVG(A.val);
>        HALF = FILTER A by val >= N;
>     GENERATE
>        FLATTEN(GROUP),
>        HALF;
> };
> dump C;
> {code}
> Presently the semantics of the Filter statement inside the FOREACH does not support these
types of operations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message