pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4789) Pig on TEZ creates wrong result with replicated join
Date Fri, 05 Feb 2016 23:19:39 GMT

    [ https://issues.apache.org/jira/browse/PIG-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135234#comment-15135234

Rohini Palaniswamy commented on PIG-4789:

bq. Also if you have an idea which patch fixes this, we could think of cherry-picking it,
running trunk in production is unfortunately no option.
   Definitely should be some fix that went into MultiQueryOptimizerTez. But many patches have
gone in with fixes and with dependencies on other classes that might be hard to cherry-pick
anymore or even just copy that whole class without copying other classes like TezCompiler,
TezLauncher, UnionOptimizer, TezOperator, etc. 

bq. Is there some ETA for a new 0.16.0 or 0.15.1 release soon, which would include this fixes?
  We are planning to do 0.15.1 in next two weeks and 0.16 after three months. Thinking of
merging all Tez related changes into 0.15.1. So it should be fixed for you in 0.15.1. 

> Pig on TEZ creates wrong result with replicated join
> ----------------------------------------------------
>                 Key: PIG-4789
>                 URL: https://issues.apache.org/jira/browse/PIG-4789
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.15.0
>            Reporter: Michael Prim
>            Priority: Critical
>         Attachments: tez_bug.pig, tez_bug_input1.csv, tez_bug_input2.csv, tez_bug_input3.csv
> Please find below a minimal example of a Pig script that uses splits and replicated joins
and where the output differs between MapReduce and TEZ as execution engine. The attachment
also contains the sample input data.
> The expected output, as created by MapReduce engine is:
> {code}
> (id1,123,A,)
> (id2,234,,B)
> (id3,456,,)
> (id4,567,A,)
> {code}
> whereas TEZ produces
> {code}
> (id1,123,A,A)
> (id2,234,B,B)
> (id3,456,,)
> (id4,567,A,A)
> {code}
> Removing the {{USING 'replicated'}} and using a regular join yields correct results.
I am not sure if this is a Pig issue or a TEZ issue. However, as this issue silently can lead
to data corruption I rated it critical. So far searching didn't indicate a similar bug or
anybody being aware of it.
> {code}
> classdata = LOAD '/tez_bug_input1.csv' USING PigStorage(',') AS (classid:chararray, class:chararray);
> data = LOAD '/tez_bug_input2.csv' USING PigStorage(',') AS (eventid:chararray, classid:chararray);
> basedata = LOAD '/tez_bug_input3.csv' USING PigStorage(',') AS (eventid:chararray, foo:int);
> dataJclassdata = JOIN classdata BY classid, data BY classid;
> SPLIT dataJclassdata INTO classA IF class == 'A', classB IF class == 'B';
> dataA = JOIN basedata BY eventid LEFT OUTER, classA BY data::eventid USING 'replicated';
> dataA = foreach dataA generate basedata::eventid as eventid
> 	, basedata::foo as foo
> 	, classA::classdata::class as classA;
> dataB = JOIN dataA BY eventid LEFT OUTER, classB BY eventid USING 'replicated';
> dataB = foreach dataB generate dataA::eventid as eventid
> 	, dataA::foo as foo
> 	, dataA::classA as classA
>     , classB::classdata::class as classB;
> DUMP dataB;
> {code}

This message was sent by Atlassian JIRA

View raw message