Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0469A18551 for ; Tue, 6 Oct 2015 15:23:09 +0000 (UTC) Received: (qmail 92242 invoked by uid 500); 6 Oct 2015 15:22:53 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 92208 invoked by uid 500); 6 Oct 2015 15:22:53 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 92198 invoked by uid 99); 6 Oct 2015 15:22:52 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2015 15:22:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 8A00F1A2121 for ; Tue, 6 Oct 2015 15:22:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.15 X-Spam-Level: *** X-Spam-Status: No, score=3.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ZNfv2M4jiDux for ; Tue, 6 Oct 2015 15:22:42 +0000 (UTC) Received: from mail-io0-f193.google.com (mail-io0-f193.google.com [209.85.223.193]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 36371204C9 for ; Tue, 6 Oct 2015 15:22:41 +0000 (UTC) Received: by iow1 with SMTP id 1so18769281iow.1 for ; Tue, 06 Oct 2015 08:22:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=6+AHCPqfK6AE58mBK99JzuLEzuuNXLlRv8hzS3scimc=; b=saafrKe6qWXeThCYwWdQkdjDewQk3jcopbgsbJhkquxemPkHNn6aR1fxI67h5IjWYp bgYseyXH9cMuTcD37vS6QY/nYsQYKh0Ze2dEaCHq7r8EHVT9o1Ysv3LzQfFb98lB0zlf ET2Px2dEnPYt5YjeA5dnZpGrRy2DOUA50z3aEo5rKR0eNIfsSx/4g7wPwD1bf/KHxqsn aJwrben6hS4DB6FqXFlN/VDg+tbQyHUXPC6tL1dzy2oFunDGpg2xggp4USsPxbYjQ6fe TAMfAylcr9jdWui9cTNu88XsA1TwAfqVVimRWS6Hm3dVXYvER6v5elmJvKQTfMyoKGEW Tnfw== X-Received: by 10.107.6.21 with SMTP id 21mr35395571iog.9.1444144960009; Tue, 06 Oct 2015 08:22:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nithin Asokan Date: Tue, 06 Oct 2015 15:22:30 +0000 Message-ID: Subject: Re: SparkPipeline Aggregators on Avro format To: "user@crunch.apache.org" Content-Type: multipart/alternative; boundary=001a113e5578587d3d05217133ef --001a113e5578587d3d05217133ef Content-Type: text/plain; charset=UTF-8 Thanks Josh, that makes sense. Logged https://issues.apache.org/jira/browse/CRUNCH-568 On Mon, Oct 5, 2015 at 5:50 PM Josh Wills wrote: > Hey Nithin, > > I'm assuming this is because there is the possibility for an Avro record > to be null inside of this application, and the UniformHashPartitioner > doesn't check for null records in its input b/c that can't happen inside of > the MR context. I'm trying to decide whether it's better to check for > nullability inside of the Spark app or inside of UniformHashPartitioner, > and I'm leaning a bit towards the Spark side right now... > > J > > On Mon, Oct 5, 2015 at 2:19 PM, Nithin Asokan wrote: > >> I have a SparkPipeline that reads an Avro source and aggregates first 20 >> elements from PCollection. I notice stages failing with a >> NullPointerException when running the pipeline on yarn-client mode. >> >> Here is the example that I used >> >> https://gist.github.com/nasokan/853ff80ce20ad7a78886 >> >> Here is the stack trace I'm seeing on my driver logs. >> >> 15/10/05 16:02:33 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 0, >> 123.domain.xyz): java.lang.NullPointerException >> at >> org.apache.crunch.impl.mr.run.UniformHashPartitioner.getPartition(UniformHashPartitioner.java:32) >> at >> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:62) >> at >> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:35) >> at >> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) >> at >> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at >> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366) >> at >> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) >> at >> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:64) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I would also like to mention that I don't see these errors when running >> over Text inputs and my SparkPipeline works as expected. Can MR package >> seen in stack trace relate to errors we are seeing? I can log a bug if >> needed. >> >> Thank you! >> Nithin >> > > --001a113e5578587d3d05217133ef Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Josh, that makes sense. Logged=C2=A0https://issues.apache.org/jir= a/browse/CRUNCH-568

On Mon, Oct 5, 2015 at 5:50 PM Josh Wills <josh.wills@gmail.com> wrote:
Hey Nithin,

I'm assum= ing this is because there is the possibility for an Avro record to be null = inside of this application, and the UniformHashPartitioner doesn't chec= k for null records in its input b/c that can't happen inside of the MR = context. I'm trying to decide whether it's better to check for null= ability inside of the Spark app or inside of UniformHashPartitioner, and I&= #39;m leaning a bit towards the Spark side right now...

J

On Mon, Oct 5, 2015 at 2:19 PM, Nithin Asokan <= anithin19@gmail.com> wrote:
I have a SparkPipeline that reads an Avro source and aggr= egates first 20 elements from PCollection. I notice stages failing with a N= ullPointerException when running the pipeline on yarn-client mode.=C2=A0
Here is the example that I used

<= a href=3D"https://gist.github.com/nasokan/853ff80ce20ad7a78886" target=3D"_= blank">https://gist.github.com/nasokan/853ff80ce20ad7a78886

Here is the stack trace I'm seeing on my driver logs.= =C2=A0

15/10/05 16:02:33 WARN TaskSetManager:= Lost task 3.0 in stage 0.0 (TID 0, 123.domain.xyz): java.lang.NullPointerException
= =C2=A0 =C2=A0 at org.apache.crunch.impl.mr.run.UniformHashPartitioner.getPa= rtition(UniformHashPartitioner.java:32)
=C2=A0 =C2=A0 at org.apac= he.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOut= putFunction.java:62)
=C2=A0 =C2=A0 at org.apache.crunch.impl.spar= k.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:35= )
=C2=A0 =C2=A0 at org.apache.spark.api.java.JavaPairRDD$$anonfun= $pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
=C2=A0 =C2=A0 = at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply= (JavaPairRDD.scala:1002)
=C2=A0 =C2=A0 at scala.collection.Iterat= or$$anon$11.next(Iterator.scala:328)
=C2=A0 =C2=A0 at scala.colle= ction.Iterator$$anon$11.next(Iterator.scala:328)
=C2=A0 =C2=A0 at= org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(Exte= rnalSorter.scala:366)
=C2=A0 =C2=A0 at org.apache.spark.util.coll= ection.ExternalSorter.insertAll(ExternalSorter.scala:211)
=C2=A0 = =C2=A0 at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffle= Writer.scala:63)
=C2=A0 =C2=A0 at org.apache.spark.scheduler.Shuf= fleMapTask.runTask(ShuffleMapTask.scala:68)
=C2=A0 =C2=A0 at org.= apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
=C2=A0 =C2=A0 at org.apache.spark.scheduler.Task.run(Task.scala:64)
=C2=A0 =C2=A0 at org.apache.spark.executor.Executor$TaskRunner.run= (Executor.scala:203)
=C2=A0 =C2=A0 at java.util.concurrent.Thread= PoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
=C2=A0 =C2= =A0 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto= r.java:617)
=C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:745= )

I would also like to mention that I don= 9;t see these errors when running over Text inputs and my SparkPipeline wor= ks as expected. Can MR package seen in stack trace relate to errors we are = seeing? I can log a bug if needed.

Thank you!
Nithin

--001a113e5578587d3d05217133ef--