pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zbigniew Rzepka (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4695) Using 'replicated' left join results in different result from regular left join.
Date Wed, 07 Oct 2015 14:03:26 GMT

     [ https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zbigniew Rzepka updated PIG-4695:
---------------------------------
    Description: 
There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN.
This may be a case only with very small data sets, as we're using piece of code shown below
in production with correct results.
EDIT:
This issue only occurs when running PIG on Tez. (We're using Tez 7.0).

Example:
I have two data sets:

first_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
(108,17,all_users,all_users)
(138,11,all_users,all_users)
{code}
second_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
{code}

When I use regular LEFT JOIN on these two I get the correct output:
{code:sql}
joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
{code}

output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(138,11,all_users,all_users,,,,)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
{code}

BUT, if I add {{USING 'replicated'}}, the result is completely different:
{code}
$joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
USING 'replicated';
{code}
output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
(138,11,all_users,all_users,,,,)
{code}

  was:
There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN.
This may be a case only with very small data sets, as we're using piece of code shown below
in production with correct results.

Example:
I have two data sets:

first_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
(108,17,all_users,all_users)
(138,11,all_users,all_users)
{code}
second_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
{code}

When I use regular LEFT JOIN on these two I get the correct output:
{code:sql}
joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
{code}

output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(138,11,all_users,all_users,,,,)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
{code}

BUT, if I add {{USING 'replicated'}}, the result is completely different:
{code}
$joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
USING 'replicated';
{code}
output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
(138,11,all_users,all_users,,,,)
{code}


> Using 'replicated' left join results in different result from regular left join.
> --------------------------------------------------------------------------------
>
>                 Key: PIG-4695
>                 URL: https://issues.apache.org/jira/browse/PIG-4695
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.15.0
>            Reporter: Zbigniew Rzepka
>
> There seems to be a difference in results between regular LEFT JOIN and replicated LEFT
JOIN. This may be a case only with very small data sets, as we're using piece of code shown
below in production with correct results.
> EDIT:
> This issue only occurs when running PIG on Tez. (We're using Tez 7.0).
> Example:
> I have two data sets:
> first_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}
> second_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> {code}
> When I use regular LEFT JOIN on these two I get the correct output:
> {code:sql}
> joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (138,11,all_users,all_users,,,,)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users,,,,)
> {code}
> BUT, if I add {{USING 'replicated'}}, the result is completely different:
> {code}
> $joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
> USING 'replicated';
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users,,,,)
> (138,11,all_users,all_users,,,,)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message