pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Pig ClassCastException trying to cast to org.apache.pig.data.DataBag
Date Thu, 25 Jun 2009 16:28:00 GMT
Hello Pig fans,

I've implemented a collaborative filtering job in Pig using CROSS and
FOREACH with a UDF. It works great until my dataset grows to a certain size,
at which point I start to get Pig ClassCastExceptions in the logs. I know
that CROSS can be expensive and difficult to scale, but it's strange to me
that when things fall over, it's due to a Pig ClassCastException. Any
insights as to why this is happening or how I should go about
troubleshooting?

Here's the script:

userAssets1 = LOAD 'sample_data/userAssets' AS (user:bytearray,
userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray,
rating:double)});
userAssets2 = LOAD 'sample_data/userAssets' AS (user:bytearray,
userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray,
rating:double)});
X = CROSS userAssets1, userAssets2 PARALLEL 20;
userToUserFilter = FILTER X BY userAssets1::user != userAssets2::user;
REGISTER pearson.jar;
dist = FOREACH userToUserFilter GENERATE userAssets1::user,
userAssets2::user,
  cnwk.grahamb.pig.PEARSON(userAssets1::userAssetRatings,
userAssets2::userAssetRatings);
similarUsers = FILTER dist BY ($2 != 0.0);
STORE similarUsers INTO 'sample_data/userSimilarityPearson';

Once the number of userAssets values grows to about 28K, the mapper
succeeds, but the reduces fails after around 60% complete. There are 558K
input records for the reducer in this case. The exceptions look like this:

 2009-06-24 11:34:47,854 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher -
Error message from task (reduce)
task_200906171500_0141_r_000012java.lang.ClassCastException:
org.apache.pig.data.DataByteArray cannot be cast to
org.apache.pig.data.DataBag
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:368)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:171)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:129)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:181)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:262)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:226)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:216)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2210)

thanks,
Bill

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message