Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
References: 
 <CABafDq117V0c-K9MW9KyrRr-z6GCMr-8CrUsiX5+NnY+nSTDug@mail.gmail.com>
 <CANb5z2KBqxZng92ToFo0MdTk2fd8jtGTjZ85h1yUo_akaetcXg@mail.gmail.com>
In-Reply-To: 
 <CANb5z2KBqxZng92ToFo0MdTk2fd8jtGTjZ85h1yUo_akaetcXg@mail.gmail.com>
From: Nithin Asokan <anithin19@gmail.com>
Date: Tue, 06 Oct 2015 15:22:30 +0000
Message-ID: 
 <CABafDq31N2BN89=KunwinmFqePA5rYV6hg30qguv9td7OxDZKw@mail.gmail.com>
Subject: Re: SparkPipeline Aggregators on Avro format
To: "user@crunch.apache.org" <user@crunch.apache.org>
Content-Type: multipart/alternative; boundary=001a113e5578587d3d05217133ef

--001a113e5578587d3d05217133ef
Content-Type: text/plain; charset=UTF-8

Thanks Josh, that makes sense. Logged
https://issues.apache.org/jira/browse/CRUNCH-568

On Mon, Oct 5, 2015 at 5:50 PM Josh Wills <josh.wills@gmail.com> wrote:

> Hey Nithin,
>
> I'm assuming this is because there is the possibility for an Avro record
> to be null inside of this application, and the UniformHashPartitioner
> doesn't check for null records in its input b/c that can't happen inside of
> the MR context. I'm trying to decide whether it's better to check for
> nullability inside of the Spark app or inside of UniformHashPartitioner,
> and I'm leaning a bit towards the Spark side right now...
>
> J
>
> On Mon, Oct 5, 2015 at 2:19 PM, Nithin Asokan <anithin19@gmail.com> wrote:
>
>> I have a SparkPipeline that reads an Avro source and aggregates first 20
>> elements from PCollection. I notice stages failing with a
>> NullPointerException when running the pipeline on yarn-client mode.
>>
>> Here is the example that I used
>>
>> https://gist.github.com/nasokan/853ff80ce20ad7a78886
>>
>> Here is the stack trace I'm seeing on my driver logs.
>>
>> 15/10/05 16:02:33 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 0,
>> 123.domain.xyz): java.lang.NullPointerException
>>     at
>> org.apache.crunch.impl.mr.run.UniformHashPartitioner.getPartition(UniformHashPartitioner.java:32)
>>     at
>> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:62)
>>     at
>> org.apache.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:35)
>>     at
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
>>     at
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
>>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>     at
>> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
>>     at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
>>     at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>     at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>     at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>     at java.lang.Thread.run(Thread.java:745)
>>
>> I would also like to mention that I don't see these errors when running
>> over Text inputs and my SparkPipeline works as expected. Can MR package
>> seen in stack trace relate to errors we are seeing? I can log a bug if
>> needed.
>>
>> Thank you!
>> Nithin
>>
>
>

--001a113e5578587d3d05217133ef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Josh, that makes sense. Logged=C2=A0<a href=3D"http=
s://issues.apache.org/jira/browse/CRUNCH-568">https://issues.apache.org/jir=
a/browse/CRUNCH-568</a></div><br><div class=3D"gmail_quote"><div dir=3D"ltr=
">On Mon, Oct 5, 2015 at 5:50 PM Josh Wills &lt;<a href=3D"mailto:josh.will=
s@gmail.com">josh.wills@gmail.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Hey Nithin,<div><br></div><div>I&#39;m assum=
ing this is because there is the possibility for an Avro record to be null =
inside of this application, and the UniformHashPartitioner doesn&#39;t chec=
k for null records in its input b/c that can&#39;t happen inside of the MR =
context. I&#39;m trying to decide whether it&#39;s better to check for null=
ability inside of the Spark app or inside of UniformHashPartitioner, and I&=
#39;m leaning a bit towards the Spark side right now...</div></div><div dir=
=3D"ltr"><div><br></div><div>J</div></div><div class=3D"gmail_extra"><br><d=
iv class=3D"gmail_quote">On Mon, Oct 5, 2015 at 2:19 PM, Nithin Asokan <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:anithin19@gmail.com" target=3D"_blank">=
anithin19@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
><div dir=3D"ltr">I have a SparkPipeline that reads an Avro source and aggr=
egates first 20 elements from PCollection. I notice stages failing with a N=
ullPointerException when running the pipeline on yarn-client mode.=C2=A0<di=
v><br></div><div>Here is the example that I used</div><div><br></div><div><=
a href=3D"https://gist.github.com/nasokan/853ff80ce20ad7a78886" target=3D"_=
blank">https://gist.github.com/nasokan/853ff80ce20ad7a78886</a><br></div><d=
iv><br></div><div>Here is the stack trace I&#39;m seeing on my driver logs.=
=C2=A0</div><div><br></div><div><div>15/10/05 16:02:33 WARN TaskSetManager:=
 Lost task 3.0 in stage 0.0 (TID 0, <a href=3D"http://123.domain.xyz" targe=
t=3D"_blank">123.domain.xyz</a>): java.lang.NullPointerException</div><div>=
=C2=A0 =C2=A0 at org.apache.crunch.impl.mr.run.UniformHashPartitioner.getPa=
rtition(UniformHashPartitioner.java:32)</div><div>=C2=A0 =C2=A0 at org.apac=
he.crunch.impl.spark.fn.PartitionedMapOutputFunction.call(PartitionedMapOut=
putFunction.java:62)</div><div>=C2=A0 =C2=A0 at org.apache.crunch.impl.spar=
k.fn.PartitionedMapOutputFunction.call(PartitionedMapOutputFunction.java:35=
)</div><div>=C2=A0 =C2=A0 at org.apache.spark.api.java.JavaPairRDD$$anonfun=
$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)</div><div>=C2=A0 =C2=A0 =
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply=
(JavaPairRDD.scala:1002)</div><div>=C2=A0 =C2=A0 at scala.collection.Iterat=
or$$anon$11.next(Iterator.scala:328)</div><div>=C2=A0 =C2=A0 at scala.colle=
ction.Iterator$$anon$11.next(Iterator.scala:328)</div><div>=C2=A0 =C2=A0 at=
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(Exte=
rnalSorter.scala:366)</div><div>=C2=A0 =C2=A0 at org.apache.spark.util.coll=
ection.ExternalSorter.insertAll(ExternalSorter.scala:211)</div><div>=C2=A0 =
=C2=A0 at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffle=
Writer.scala:63)</div><div>=C2=A0 =C2=A0 at org.apache.spark.scheduler.Shuf=
fleMapTask.runTask(ShuffleMapTask.scala:68)</div><div>=C2=A0 =C2=A0 at org.=
apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)</div=
><div>=C2=A0 =C2=A0 at org.apache.spark.scheduler.Task.run(Task.scala:64)</=
div><div>=C2=A0 =C2=A0 at org.apache.spark.executor.Executor$TaskRunner.run=
(Executor.scala:203)</div><div>=C2=A0 =C2=A0 at java.util.concurrent.Thread=
PoolExecutor.runWorker(ThreadPoolExecutor.java:1142)</div><div>=C2=A0 =C2=
=A0 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto=
r.java:617)</div><div>=C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:745=
)</div></div><div><br></div><div>I would also like to mention that I don=
9;t see these errors when running over Text inputs and my SparkPipeline wor=
ks as expected. Can MR package seen in stack trace relate to errors we are =
seeing? I can log a bug if needed.</div><div><br></div><div>Thank you!</div=
><span><font color=3D"#888888"><div>Nithin</div></font></span></div>
</blockquote></div><br></div>
</blockquote></div>

--001a113e5578587d3d05217133ef--