hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-976) Multi-query optimization throws ClassCastException
Date Thu, 08 Oct 2009 16:26:31 GMT

     [ https://issues.apache.org/jira/browse/PIG-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Ding updated PIG-976:
-----------------------------

    Attachment: PIG-976.patch

It turns out that there is another case that needs to consider.

The following two scripts represent these different cases:

{code}
A = LOAD 'input';
B = GROUP A ALL;
C = FOREACH B GENERATE COUNT(A);
D = STORE C INTO 'output';
{code}

and 

{code}
A = LOAD 'input';
B = GROUP A BY $0;
C = FOREACH B GENERATE COUNT(A),  group;
D = STORE C INTO 'output';
{code}

When running these scripts, the key ('group') value is not the first field of the output tuple
of the combiner package. In the second case, the key ('group') is part of the final output.

In this patch, multi-query optimizer considers both cases when merging combiners and reducers.
 

> Multi-query optimization throws ClassCastException
> --------------------------------------------------
>
>                 Key: PIG-976
>                 URL: https://issues.apache.org/jira/browse/PIG-976
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Ankur
>            Assignee: Richard Ding
>         Attachments: PIG-976.patch, PIG-976.patch
>
>
> Multi-query optimization fails to merge 2 branches when 1 is a result of Group By ALL
and another is a result of Group By field1 where field 1 is of type long. Here is the script
that fails with multi-query on.
> data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); 
> A = GROUP data ALL;
> B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2;
> C = FOREACH B GENERATE (sum1/sum2) AS rate; 
> STORE C INTO 'result1';
> D = GROUP data BY a; 
> E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c);
> STORE E into 'result2';
>  
> Here is the exception from the logs
> java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message