Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 65277 invoked from network); 14 Apr 2010 00:38:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 14 Apr 2010 00:38:25 -0000 Received: (qmail 24485 invoked by uid 500); 14 Apr 2010 00:38:25 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 24464 invoked by uid 500); 14 Apr 2010 00:38:25 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 24456 invoked by uid 99); 14 Apr 2010 00:38:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 00:38:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 00:38:20 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3E0bwln020218 for ; Tue, 13 Apr 2010 20:37:59 -0400 (EDT) Message-ID: <9741091.84961271205478934.JavaMail.jira@thor> Date: Tue, 13 Apr 2010 20:37:58 -0400 (EDT) From: "Daniel Dai (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Updated: (PIG-1374) PushDownForeachFlatten shall not push ForEach below Join if the flattened fields is used in the next statement In-Reply-To: <8816902.11731271121469856.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1374: ---------------------------- Summary: PushDownForeachFlatten shall not push ForEach below Join if the flattened fields is used in the next statement (was: Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag) Description: A reproducible sample: {code} a = load '2.txt' as (b{t(a0:chararray,a1:int)}); b = foreach a generate flatten($0); c = order b by $1 desc; dump c; {code} 2.txt {code} {(a,1),(b,2)} {code} Error message: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:479) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:332) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:233) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) The problem is we push foreach below order by, but the foreach generated field is used by order by. Original report: Subject: Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag Script loads data from BinStorage(), then flattens columns and then sorts on the second column with order descending. The order by fails with the ClassCastException {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $1 desc; dump d; {code} The sampling job fails with the following error: =============================================================================================================== java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:159) =============================================================================================================== The schema for b, c and d are as follows: b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}} c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} If we modify this script to order on the first column it seems to work {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $0 desc; dump d; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) There is a workaround to do a projection before ORDER {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; newc = foreach c generate $0 as uuid, $1 as velocity; newd = order newc by velocity desc; dump newd; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) The schema for the Loader is as follows: {code} public Schema outputSchema(Schema input) { try{ List list = new ArrayList(); list.add(new Schema.FieldSchema("uuid", DataType.CHARARRAY)); list.add(new Schema.FieldSchema("velocity", DataType.DOUBLE)); Schema tupleSchema = new Schema(list); Schema.FieldSchema tupleFs = new Schema.FieldSchema("tuple", tupleSchema, DataType.TUPLE); Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema("bag_of_tuples",bagSchema, DataType.BAG); return new Schema(bagFs); }catch (Exception e){ return null; } } {code} was: Script loads data from BinStorage(), then flattens columns and then sorts on the second column with order descending. The order by fails with the ClassCastException {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $1 desc; dump d; {code} The sampling job fails with the following error: =============================================================================================================== java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:159) =============================================================================================================== The schema for b, c and d are as follows: b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}} c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} If we modify this script to order on the first column it seems to work {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $0 desc; dump d; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) There is a workaround to do a projection before ORDER {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; newc = foreach c generate $0 as uuid, $1 as velocity; newd = order newc by velocity desc; dump newd; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) The schema for the Loader is as follows: {code} public Schema outputSchema(Schema input) { try{ List list = new ArrayList(); list.add(new Schema.FieldSchema("uuid", DataType.CHARARRAY)); list.add(new Schema.FieldSchema("velocity", DataType.DOUBLE)); Schema tupleSchema = new Schema(list); Schema.FieldSchema tupleFs = new Schema.FieldSchema("tuple", tupleSchema, DataType.TUPLE); Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema("bag_of_tuples",bagSchema, DataType.BAG); return new Schema(bagFs); }catch (Exception e){ return null; } } {code} > PushDownForeachFlatten shall not push ForEach below Join if the flattened fields is used in the next statement > -------------------------------------------------------------------------------------------------------------- > > Key: PIG-1374 > URL: https://issues.apache.org/jira/browse/PIG-1374 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.6.0, 0.7.0 > Reporter: Viraj Bhat > Assignee: Daniel Dai > Fix For: 0.7.0 > > > A reproducible sample: > {code} > a = load '2.txt' as (b{t(a0:chararray,a1:int)}); > b = foreach a generate flatten($0); > c = order b by $1 desc; > dump c; > {code} > 2.txt > {code} > {(a,1),(b,2)} > {code} > Error message: > java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag > at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:479) > at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) > at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:332) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:233) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > The problem is we push foreach below order by, but the foreach generated field is used by order by. > Original report: > Subject: Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag > Script loads data from BinStorage(), then flattens columns and then sorts on the second column with order descending. The order by fails with the ClassCastException > {code} > register loader.jar; > a = load 'c2' using BinStorage(); > b = foreach a generate org.apache.pig.CCMLoader(*); > describe b; > c = foreach b generate flatten($0); > describe c; > d = order c by $1 desc; > dump d; > {code} > The sampling job fails with the following error: > =============================================================================================================== > java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > =============================================================================================================== > The schema for b, c and d are as follows: > b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}} > c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} > d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} > If we modify this script to order on the first column it seems to work > {code} > register loader.jar; > a = load 'c2' using BinStorage(); > b = foreach a generate org.apache.pig.CCMLoader(*); > describe b; > c = foreach b generate flatten($0); > describe c; > d = order c by $0 desc; > dump d; > {code} > (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) > (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) > There is a workaround to do a projection before ORDER > {code} > register loader.jar; > a = load 'c2' using BinStorage(); > b = foreach a generate org.apache.pig.CCMLoader(*); > describe b; > c = foreach b generate flatten($0); > describe c; > newc = foreach c generate $0 as uuid, $1 as velocity; > newd = order newc by velocity desc; > dump newd; > {code} > (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) > (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) > The schema for the Loader is as follows: > {code} > public Schema outputSchema(Schema input) { > try{ > List list = new ArrayList(); > list.add(new Schema.FieldSchema("uuid", DataType.CHARARRAY)); > list.add(new Schema.FieldSchema("velocity", DataType.DOUBLE)); > Schema tupleSchema = new Schema(list); > Schema.FieldSchema tupleFs = new Schema.FieldSchema("tuple", tupleSchema, DataType.TUPLE); > Schema bagSchema = new Schema(tupleFs); > bagSchema.setTwoLevelAccessRequired(true); > Schema.FieldSchema bagFs = new Schema.FieldSchema("bag_of_tuples",bagSchema, DataType.BAG); > return new Schema(bagFs); > }catch (Exception e){ > return null; > } > } > {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira