pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Prim (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4789) Pig on TEZ creates wrong result with replicated join
Date Thu, 28 Jan 2016 14:00:43 GMT

     [ https://issues.apache.org/jira/browse/PIG-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Prim updated PIG-4789:
------------------------------
    Attachment: tez_bug.pig
                tez_bug_input3.csv
                tez_bug_input2.csv
                tez_bug_input1.csv

Attached the input sample files and the pig script, not possible during ticket creation.

> Pig on TEZ creates wrong result with replicated join
> ----------------------------------------------------
>
>                 Key: PIG-4789
>                 URL: https://issues.apache.org/jira/browse/PIG-4789
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.15.0
>            Reporter: Michael Prim
>            Priority: Critical
>         Attachments: tez_bug.pig, tez_bug_input1.csv, tez_bug_input2.csv, tez_bug_input3.csv
>
>
> Please find below a minimal example of a Pig script that uses splits and replicated joins
and where the output differs between MapReduce and TEZ as execution engine. The attachment
also contains the sample input data.
> The expected output, as created by MapReduce engine is:
> {code}
> (id1,123,A,)
> (id2,234,,B)
> (id3,456,,)
> (id4,567,A,)
> {code}
> whereas TEZ produces
> {code}
> (id1,123,A,A)
> (id2,234,B,B)
> (id3,456,,)
> (id4,567,A,A)
> {code}
> Removing the {{USING 'replicated'}} and using a regular join yields correct results.
I am not sure if this is a Pig issue or a TEZ issue. However, as this issue silently can lead
to data corruption I rated it critical. So far searching didn't indicate a similar bug or
anybody being aware of it.
> {code}
> classdata = LOAD '/tez_bug_input1.csv' USING PigStorage(',') AS (classid:chararray, class:chararray);
> data = LOAD '/tez_bug_input2.csv' USING PigStorage(',') AS (eventid:chararray, classid:chararray);
> basedata = LOAD '/tez_bug_input3.csv' USING PigStorage(',') AS (eventid:chararray, foo:int);
> dataJclassdata = JOIN classdata BY classid, data BY classid;
> SPLIT dataJclassdata INTO classA IF class == 'A', classB IF class == 'B';
> dataA = JOIN basedata BY eventid LEFT OUTER, classA BY data::eventid USING 'replicated';
> dataA = foreach dataA generate basedata::eventid as eventid
> 	, basedata::foo as foo
> 	, classA::classdata::class as classA;
> dataB = JOIN dataA BY eventid LEFT OUTER, classB BY eventid USING 'replicated';
> dataB = foreach dataB generate dataA::eventid as eventid
> 	, dataA::foo as foo
> 	, dataA::classA as classA
>     , classB::classdata::class as classB;
> DUMP dataB;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message