pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1831) Indeterministic behavior in local mode due to static variable PigMapReduce.sJobConf
Date Mon, 31 Jan 2011 21:47:28 GMT

    [ https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988910#comment-12988910
] 

Daniel Dai commented on PIG-1831:
---------------------------------

This issue is caused by race condition in using static variable PigMapReduce.sJobConf in local
mode. In local mode, all mapreduce job share a single VM. Pig keep on overwriting static variable
PigMapReduce.sJobConf each time we launch a new mapreduce job. When multiple mapreduce jobs
launching simultaneously, one mapreduce job may use config for other mapreduce job, and cause
indeterministic behavior. Options to fix this issue are:
1. force local mode run mapreduce job sequentially, if there is a way
2. Make sJobConf an array keyed by mapreduce jobid. However, some UDFs is using sJobConf,
we could break backward compatibility

> Indeterministic behavior in local mode due to static variable PigMapReduce.sJobConf
> -----------------------------------------------------------------------------------
>
>                 Key: PIG-1831
>                 URL: https://issues.apache.org/jira/browse/PIG-1831
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Vivek Padmanabhan
>            Assignee: Daniel Dai
>
> The below script when run in local mode gives me a different output. It looks like in
local mode I have to store a relation obtained through streaming in order to use it afterwards.
>  For example consider the below script : 
> DEFINE MySTREAMUDF `test.sh`;
> A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
> B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);
> --STORE B into 'output.B';
> C = JOIN B by wId LEFT OUTER, A by myId;
> D = FOREACH C GENERATE B::wId,B::num,data4 ;
> D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);
> --STORE D into 'output.D';
> E = foreach B GENERATE wId,num;
> F = DISTINCT E;
> G = GROUP F ALL;
> H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
> I = CROSS D,H;
> STORE I  into 'output.I';
> test.sh
> ---------
> #/bin/bash
> cut -f1,3
> And input is 
> abcd    label1  11      feature1
> acbd    label2  22      feature2
> adbc    label3  33      feature3
> Here if I store relation B and D then everytime i get the result  :
> acbd            3
> abcd            3
> adbc            3
> But if i dont store relations B and D then I get an empty output.  Here again I have
observed that this behaviour is random ie sometimes like 1out of 5 runs there will be output.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message