Mailing-List: contact user-help@spark.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@spark.incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
Message-ID: <5276EA0D.2070600@media.mit.edu>
Date: Sun, 03 Nov 2013 19:27:57 -0500
From: Yadid Ayzenberg <yadid@media.mit.edu>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: user@spark.incubator.apache.org
Subject: Re: java.io.NotSerializableException on RDD count() in Java
References: <52769648.2080207@mit.edu>	<527696E2.5080407@media.mit.edu>
 <CABPQxsu9voC84CYW-qF_9TVvcmdZ4yf1yFmCuqO6hN2-x58tFQ@mail.gmail.com>
In-Reply-To: 
 <CABPQxsu9voC84CYW-qF_9TVvcmdZ4yf1yFmCuqO6hN2-x58tFQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Patrick,

I am in fact using Kryo and im registering  BSONObject.class (which is 
class holding the data) in my KryoRegistrator.
Im not sure what other classes I should be registering.

Thanks,

Yadid


On 11/3/13 7:23 PM, Patrick Wendell wrote:
> The problem is you are referencing a class that does not "extend
> serializable" in the data that you shuffle. Spark needs to send all
> shuffle data over the network, so it needs to know how to serialize
> them.
>
> One option is to use Kryo for network serialization as described here
> - you'll have to register all the class that get serialized though.
>
> http://spark.incubator.apache.org/docs/latest/tuning.html
>
> Another option is to write a wrapper class that "extends
> externalizable" and write the serialization yourself.
>
> - Patrick
>
> On Sun, Nov 3, 2013 at 10:33 AM, Yadid Ayzenberg <yadid@media.mit.edu> wrote:
>> Hi All,
>>
>> My original RDD contains arrays of doubles. when appying a count() operator
>> to the original RDD I get the result as expected.
>> However when I run a map on the original RDD in order to generate a new RDD
>> with only the first element of each array, and try to apply count() to the
>> new generated RDD I get the following exception:
>>
>> 19829 [run-main] INFO  org.apache.spark.scheduler.DAGScheduler  - Failed to
>> run count at AnalyticsEngine.java:133
>> [error] (run-main) org.apache.spark.SparkException: Job failed:
>> java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine
>> org.apache.spark.SparkException: Job failed:
>> java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine
>>      at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
>>      at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
>>      at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>>      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>      at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
>>      at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556)
>>      at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:503)
>>      at
>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:361)
>>      at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
>>      at
>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
>>
>>
>> If a run a take() operation on the new RDD I receive the results as
>> expected. here is my code:
>>
>>
>> JavaRDD<Double> rdd2 =  rdd.flatMap( new FlatMapFunction<Tuple2<Object,
>> BSONObject>, Double>() {
>>          @Override
>>          public Iterable<Double> call(Tuple2<Object, BSONObject> e) {
>>            BSONObject doc = e._2();
>>            List<List<Double>> vals = (List<List<Double>>)doc.get("data");
>>            List<Double> results = new ArrayList<Double>();
>>            for (int i=0; i< vals.size();i++ )
>>                results.add((Double)vals.get(i).get(0));
>>            return results;
>>
>>          }
>>          });
>>
>>          logger.info("Take: {}", rdd2.take(100));
>>          logger.info("Count: {}", rdd2.count());
>>
>>
>> Any ideas on what I am doing wrong ?
>>
>> Thanks,
>>
>> Yadid
>>
>>
>>
>> --
>> Yadid Ayzenberg
>> Graduate Student and Research Assistant
>> Affective Computing
>> Phone: 617-866-7226
>> Room: E14-274G
>> MIT Media Lab
>> 75 Amherst st, Cambridge, MA, 02139
>>
>>
>>


-- 
Yadid Ayzenberg
Graduate Student and Research Assistant
Affective Computing
Phone: 617-866-7226
Room: E14-274G
MIT Media Lab
75 Amherst st, Cambridge, MA, 02139