hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1289) PIG Join fails while doing a filter on joined data
Date Wed, 17 Mar 2010 00:23:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846228#action_12846228
] 

Alan Gates commented on PIG-1289:
---------------------------------

In the case of 

D = filter C by t > 0

the filter will evaluate to null when t is null.  By definition filters return only records
that evaluate true.  So t > 0 will have the affect of filtering out all outer records of
A because t will be null for every one of them.  That is, it turns the join into an inner
join.  However, if the filter is pushed above the join, it will remain an outer join, since
it will only filter the records from B where t > 0 and not the outer records from A.  Thus
this transformation is not output neutral.

> PIG Join fails while doing a filter on joined data
> --------------------------------------------------
>
>                 Key: PIG-1289
>                 URL: https://issues.apache.org/jira/browse/PIG-1289
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Karim Saadah
>            Assignee: Daniel Dai
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: PIG-1289-1.patch
>
>
> PIG Join fails while doing a filter on joined data
> Here are the steps to reproduce it:
> -bash-3.1$ pig -latest -x local
> grunt> a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray);
> grunt> DUMP a;
> (1,A)
> (2,B)
> (3,C)
> (4,D)
> grunt> b = load 'second.dat' using PigStorage() as (f3:chararray);
> grunt> DUMP b;
> (A)
> (D)
> (E)
> grunt> c = join a by f2 LEFT OUTER, b by f3;
> grunt> DUMP c;
> (1,A,A)
> (2,B,)
> (3,C,)
> (4,D,D)
> grunt> describe c;
> c: {a::f1: int,a::f2: chararray,b::f3: chararray}
> grunt> d = filter c by (f3 is null or f3 =='');
> grunt> dump d;
> 2010-03-03 15:00:37,129 [main] INFO  org.apache.pig.impl.logicalLayer.optimizer.PruneColumns
- No column pruned for b
> 2010-03-03 15:00:37,129 [main] INFO  org.apache.pig.impl.logicalLayer.optimizer.PruneColumns
- No map keys pruned for b
> 2010-03-03 15:00:37,129 [main] INFO  org.apache.pig.impl.logicalLayer.optimizer.PruneColumns
- No column pruned for a
> 2010-03-03 15:00:37,130 [main] INFO  org.apache.pig.impl.logicalLayer.optimizer.PruneColumns
- No map keys pruned for a
> 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable
to store alias d
> This one is failing too:
> grunt> d = filter c by (b::f3 is null or b::f3 =='');
> or this one not returning results as expected:
> grunt> d = foreach c generate f1 as f1, f2 as f2, f3 as f3;
> grunt> e = filter d by (f3 is null or f3 =='');
> grunt> DUMP e;
> (1,A,)
> (2,B,)
> (3,C,)
> (4,D,)
> while the expected result is
> (2,B,)
> (3,C,)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message