crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: SparkPipeline Aggregators on Avro format
Date Mon, 05 Oct 2015 22:49:52 GMT
Hey Nithin,

I'm assuming this is because there is the possibility for an Avro record to
be null inside of this application, and the UniformHashPartitioner doesn't
check for null records in its input b/c that can't happen inside of the MR
context. I'm trying to decide whether it's better to check for nullability
inside of the Spark app or inside of UniformHashPartitioner, and I'm
leaning a bit towards the Spark side right now...

J

On Mon, Oct 5, 2015 at 2:19 PM, Nithin Asokan <anithin19@gmail.com> wrote:

> I have a SparkPipeline that reads an Avro source and aggregates first 20
> elements from PCollection. I notice stages failing with a
> NullPointerException when running the pipeline on yarn-client mode.
>
> Here is the example that I used
>
> https://gist.github.com/nasokan/853ff80ce20ad7a78886
>
> Here is the stack trace I'm seeing on my driver logs.
>
> 15/10/05 16:02:33 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 0,
> 123.domain.xyz): java.lang.NullPointerException
>     at
> org.apache.crunch.impl.mr.run.UniformHashPartitioner.getPartition(UniformHashPartitioner.java:32)
>     at
> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:62)
>     at
> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:35)
>     at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
>     at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at
> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
>     at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
>     at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>     at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>     at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:64)
>     at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
>
> I would also like to mention that I don't see these errors when running
> over Text inputs and my SparkPipeline works as expected. Can MR package
> seen in stack trace relate to errors we are seeing? I can log a bug if
> needed.
>
> Thank you!
> Nithin
>

Mime
View raw message