hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour
Date Tue, 21 Sep 2010 17:54:34 GMT

    [ https://issues.apache.org/jira/browse/PIG-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913120#action_12913120
] 

Alan Gates commented on PIG-1633:
---------------------------------

This is a design decision we made when implementing nested foreach.  Each expression in the
generate list has its own pipeline.  This had the advantage that it was easy to implement.
 The disadvantages are that it invokes certain operators (like your random function) multiple
times.  This is inefficient performance wise.  In the case of indeterminate functions it also
produces strange results.  We could not think of any use cases where users would have indeterminate
functions so we did not worry about that too much.  If you have a real use case we would be
interested.

> Using an alias withing Nested Foreach causes indeterminate behaviour
> --------------------------------------------------------------------
>
>                 Key: PIG-1633
>                 URL: https://issues.apache.org/jira/browse/PIG-1633
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0, 0.5.0, 0.6.0, 0.7.0
>            Reporter: Viraj Bhat
>
> I have created a RANDOMINT function which generates random numbers between (0 and specified
value), For example RANDOMINT(4) gives random numbers between 0 and 3 (inclusive)
> {code}
> $hadoop fs -cat rand.dat
> f
> g
> h
> i
> j
> k
> l
> m
> {code}
> The pig script is as follows:
> {code}
> register math.jar;
> A = load 'rand.dat' using PigStorage() as (data);
> B = foreach A {
>         r = math.RANDOMINT(4);
>         generate
>                 data,
>                 r as random,
>                 ((r == 3)?1:0) as quarter;
>         };
> dump B;
> {code}
> The results are as follows:
> {code}
> {color:red} 
> (f,0,0)
> (g,3,0)
> (h,0,0)
> (i,2,0)
> (j,3,0)
> (k,2,0)
> (l,0,1)
> (m,1,0)
> {color} 
> {code}
> If you observe, (j,3,0) is created because r is used both in the foreach and generate
clauses and generate different values.
> Modifying the above script to below solves the issue. The M/R jobs from both scripts
are the same. It is just a matter of convenience. 
> {code}
> A = load 'rand.dat' using PigStorage() as (data);
> B = foreach A generate
>         data,
>         math.RANDOMINT(4) as r;
> C = foreach B generate
>         data,
>         r,
>         ((r == 3)?1:0) as quarter;
> dump C;
> {code}
> Is this issue related to PIG:747?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message