pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3902) PigServer creates cycle
Date Mon, 05 May 2014 17:28:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989739#comment-13989739

Cheolsoo Park commented on PIG-3902:

{quote}Presumably this happens elsewhere but I only ran the multiquery tests with my patch.{quote}
Here are unit test failures-
>>> org.apache.pig.test.TestGrunt.testIllustrate 	0.23 sec	1
>>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobSimple 	1 min 27 sec	1
>>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobMultiStoreOnPred 	1 min
26 sec	1
>>> org.apache.pig.test.TestPigStorage.testSchemaConversion2 	6.6 sec	1
>>> org.apache.pig.test.TestPigStorage.testSchemaConversion 

> PigServer creates cycle
> -----------------------
>                 Key: PIG-3902
>                 URL: https://issues.apache.org/jira/browse/PIG-3902
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11.1
>            Reporter: Jacob Perkins
>            Assignee: Cheolsoo Park
>             Fix For: 0.11.1
>         Attachments: broken-plan.png, cycle.diff, multiquery-cycle.diff
> Under certain conditions PigServer creates a cycle in the logical plan. Consider the
following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an input path,
and, if there's no path from the input to the output, make the input a dependency of the output.
However, PigServer orders the loads and stores arbitrarily during that logic. Sometimes, in
the code above, C is correctly wired as a dependency of B and, since that creates a path from
A to D, A won't be made a dependency of D and we're good. On occasion though, the ordering
being arbitrary, A is wired as a dependency of D. That's no good. To be fair, it's not actually
a cycle, since when A is wired to D, there's a path between C and B so the cycle won't actually
get created. But it's still a broken plan.
> The offending PigServer code: https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice I had to
use a store function that wouldn't check the output. If you're just using PigStorage this
won't be reproducible since you can't write to the same location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver',
'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in mind for the
specific use case that brought this up, but the general problem is more difficult. 
> As an aside, there's other issues with the PigServer code referenced above. For example,
it should almost certainly be using the full path (after LoadFunc/StoreFunc.relativeToAbsolutePath)
no? Try storing to a relative path then loading from the absolute representation of that path
in the same script... Also, why isn't it checking the FuncSpec as well as the location? Just
trying to open up the discussion.

This message was sent by Atlassian JIRA

View raw message