pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (PIG-5309) Problem with tez + union + replicated join
Date Fri, 13 Oct 2017 22:42:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rohini Palaniswamy reassigned PIG-5309:
---------------------------------------

         Assignee: Rohini Palaniswamy
    Fix Version/s: 0.17.1
                   0.18.0

Had one of our users run into this as well. This is not related to PIG-3856. That is an optimization
when same replicated join data has to be sent to multiple different vertices. In this case,
same replicated join data is being sent to a single vertex twice which is causing the error
(there can be only one edge between two vertices).  In this case oldAFeatures, newAFeatures,
BFeatures all join with the replicated table. The UnionOptimizer ensures there is a single
edge for oldAFeatures + newAFeatures (MultiQuery_Union_3/4 e2e testcases). But another gets
added for BFeatures which is a issue.

> Problem with tez + union + replicated join
> ------------------------------------------
>
>                 Key: PIG-5309
>                 URL: https://issues.apache.org/jira/browse/PIG-5309
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.17.0
>            Reporter: Will Oberman
>            Assignee: Rohini Palaniswamy
>            Priority: Minor
>             Fix For: 0.18.0, 0.17.1
>
>
> I've been using Pig 0.12.1 for quite some time and am finally upgrading to 0.17.  One
of my existing scripts failed.  I have a workaround (SET pig.tez.opt.union false), but I thought
I'd pass on the problem I observed.  
> In stdout: 
> {noformat}
> ERROR 2017: Internal error creating job configuration.
> {noformat}
> In the Pig log:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Edge [scope-93 : org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor]
-> [scope-83 : org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor]
({ BROADCAST : org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >>
org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager }) already
defined!
> 	at org.apache.tez.dag.api.DAG.addEdge(DAG.java:272)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:404)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:259)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:56)
> 	at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:87)
> 	at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.buildDAG(TezJobCompiler.java:69)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:120)
> 	... 20 more
> {noformat}
> I played around with a minimum viable test script and can cause this to fail:
> {noformat}
> weblogs = LOAD '/tmp/in/weblogInfo' as (path:chararray, queryMap:map[chararray]); 
> featureToExtraData = LOAD '/tmp/in/featureToExtraData' as (feature:chararray, extraData:chararray);

> oldA = FILTER weblogs BY path == '/A';
> newA = FILTER weblogs BY path == '/somethingElse';
> B = FILTER weblogs BY path == '/B';
> oldAFeatures = FOREACH oldA GENERATE queryMap#'feature1' as feature1, queryMap#'feature2'
as feature2;
> newAFeatures = FOREACH newA GENERATE queryMap#'different1' as feature1, queryMap#'different2'
as feature2;
> AFeatures = UNION oldAFeatures, newAFeatures;
> AFeaturesPlusMore = JOIN AFeatures BY feature1 LEFT, featureToExtraData BY feature USING
'replicated';
> BFeatures = FOREACH B GENERATE queryMap#'somethingElseEntirely1' as feature1, queryMap#'somethingElseEntirely2'
as feature2;
> BFeaturesPlusMore = JOIN BFeatures BY feature1 LEFT, featureToExtraData BY feature USING
'replicated';
> STORE AFeaturesPlusMore INTO '/tmp/out/1/AFeaturesPlusMore';
> STORE BFeaturesPlusMore INTO '/tmp/out/1/BFeaturesPlusMore';
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message