hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
Date Tue, 31 Aug 2010 22:56:53 GMT

    [ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904819#action_12904819

Scott Carey commented on PIG-1506:

The SQL behavior of the above for an outer join would be to have five rows output -- just
like COGROUP would if flattened.  So that seems fine to me.  A self-join should be the same
as a COGROUP with yourself, which is different than a simple GROUP.

However, there is a problem with inner join and nulls.
Pig JOIN is not like SQL with respect to nulls on multi-column joins.  (I have not tried on
trunk however)

In SQL, if ANY of the columns in a multi-column join is null, the row is not output. 


A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by (name,age), B by (name,age);
dump C;

The result for SQL would be one row of the form 
joe 5 2.5 joe 5 2.5

> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message