pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4695) Using 'replicated' left join results in different result from regular left join.
Date Tue, 13 Oct 2015 04:10:05 GMT

    [ https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954341#comment-14954341
] 

Rohini Palaniswamy commented on PIG-4695:
-----------------------------------------

With current trunk code, I get the right results. Haven't checked with 0.15 though.

> Using 'replicated' left join results in different result from regular left join.
> --------------------------------------------------------------------------------
>
>                 Key: PIG-4695
>                 URL: https://issues.apache.org/jira/browse/PIG-4695
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.15.0
>            Reporter: Zbigniew Rzepka
>
> There seems to be a difference in results between regular LEFT JOIN and replicated LEFT
JOIN. This may be a case only with very small data sets, as we're using piece of code shown
below in production with correct results.
> EDIT:
> This issue only occurs when running PIG on Tez. (We're using Tez 7.0).
> Example:
> I have two data sets:
> first_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}
> second_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> {code}
> When I use regular LEFT JOIN on these two I get the correct output:
> {code:sql}
> joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (138,11,all_users,all_users,,,,)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users,,,,)
> {code}
> BUT, if I add {{USING 'replicated'}}, the result is completely different:
> {code}
> $joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
> USING 'replicated';
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users,,,,)
> (138,11,all_users,all_users,,,,)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message