Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B9EA6C8AE for ; Mon, 4 Nov 2013 00:28:27 +0000 (UTC) Received: (qmail 79712 invoked by uid 500); 4 Nov 2013 00:28:27 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 79680 invoked by uid 500); 4 Nov 2013 00:28:27 -0000 Mailing-List: contact user-help@spark.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@spark.incubator.apache.org Delivered-To: mailing list user@spark.incubator.apache.org Received: (qmail 79672 invoked by uid 99); 4 Nov 2013 00:28:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2013 00:28:27 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [18.7.68.35] (HELO dmz-mailsec-scanner-6.mit.edu) (18.7.68.35) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2013 00:28:20 +0000 X-AuditID: 12074423-b7fc98e0000009a2-04-5276ea0f790c Received: from mailhub-auth-1.mit.edu ( [18.9.21.35]) by dmz-mailsec-scanner-6.mit.edu (Symantec Messaging Gateway) with SMTP id 6B.2A.02466.F0AE6725; Sun, 3 Nov 2013 19:27:59 -0500 (EST) Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by mailhub-auth-1.mit.edu (8.13.8/8.9.2) with ESMTP id rA40RwKx007422 for ; Sun, 3 Nov 2013 19:27:58 -0500 Received: from vpn-i.media.mit.edu (vpn-i.media.mit.edu [18.85.4.148]) (authenticated bits=0) (User authenticated as yadid@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.8/8.12.4) with ESMTP id rA40RvGM013504 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Sun, 3 Nov 2013 19:27:58 -0500 Message-ID: <5276EA0D.2070600@media.mit.edu> Date: Sun, 03 Nov 2013 19:27:57 -0500 From: Yadid Ayzenberg User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: user@spark.incubator.apache.org Subject: Re: java.io.NotSerializableException on RDD count() in Java References: <52769648.2080207@mit.edu> <527696E2.5080407@media.mit.edu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrDIsWRmVeSWpSXmKPExsUixCmqrMv/qizIYNpjdos3kz6zOzB6rFp3 njmAMYrLJiU1J7MstUjfLoErY07/cZaCafIVbfN3MjYwzpHsYuTkkBAwkbjycwcLhC0mceHe erYuRi4OIYF9jBJ3XkxnhXCuMUoc2dfHCOE8Y5J4MWMfE0gLr4COxPflfWDtLAKqErNvtYPF 2QS0JVYd+M8GYosKJEtsXXyPBaJeUOLkzCdANgeHiICyRPdrHpCwsICLRMuUBywQ83sYJW6d 6GEDqeEUCJSY/CoQxGQWsJb4trsIpJxZQF5i+9s5zBMYBWYhGToLoWoWkqoFjMyrGGVTcqt0 cxMzc4pTk3WLkxPz8lKLdM30cjNL9FJTSjcxggKS3UV5B+Ofg0qHGAU4GJV4eAuulgUJsSaW FVfmHmKU5GBSEuVVvw8U4kvKT6nMSCzOiC8qzUktPsQowcGsJMLrfA4ox5uSWFmVWpQPk5Lm YFES573FYR8kJJCeWJKanZpakFoEk5Xh4FCS4H37AqhRsCg1PbUiLTOnBCHNxMEJMpwHaPhD kBre4oLE3OLMdIj8KUZdjmfdn74xCrHk5eelSonzBoAUCYAUZZTmwc2BJZJXjOJAbwnztj8H quIBJiG4Sa+AljABLUlYArakJBEhJdXAOLvezvp7vQIjm89V6+2LnifkBcWpvPA01FrM+HPP 38k3r2s+nD+HZeM3uaN/r9zdF6qSm/nr7JG339cYJDkc0Z3/aOE5sRzeX2srN2544PpXbbWd 5JQfj2+FBSt0O11kE4g9nP/EkqVu2jeHoL2reOTc0w8crn10ry3H/2910rEgxxY+D/9MYyWW 4oxEQy3mouJEACkKdW3/AgAA X-Virus-Checked: Checked by ClamAV on apache.org Hi Patrick, I am in fact using Kryo and im registering BSONObject.class (which is class holding the data) in my KryoRegistrator. Im not sure what other classes I should be registering. Thanks, Yadid On 11/3/13 7:23 PM, Patrick Wendell wrote: > The problem is you are referencing a class that does not "extend > serializable" in the data that you shuffle. Spark needs to send all > shuffle data over the network, so it needs to know how to serialize > them. > > One option is to use Kryo for network serialization as described here > - you'll have to register all the class that get serialized though. > > http://spark.incubator.apache.org/docs/latest/tuning.html > > Another option is to write a wrapper class that "extends > externalizable" and write the serialization yourself. > > - Patrick > > On Sun, Nov 3, 2013 at 10:33 AM, Yadid Ayzenberg wrote: >> Hi All, >> >> My original RDD contains arrays of doubles. when appying a count() operator >> to the original RDD I get the result as expected. >> However when I run a map on the original RDD in order to generate a new RDD >> with only the first element of each array, and try to apply count() to the >> new generated RDD I get the following exception: >> >> 19829 [run-main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to >> run count at AnalyticsEngine.java:133 >> [error] (run-main) org.apache.spark.SparkException: Job failed: >> java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine >> org.apache.spark.SparkException: Job failed: >> java.io.NotSerializableException: edu.mit.bsense.AnalyticsEngine >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758) >> at >> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556) >> at >> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:503) >> at >> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:361) >> at >> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441) >> at >> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149) >> >> >> If a run a take() operation on the new RDD I receive the results as >> expected. here is my code: >> >> >> JavaRDD rdd2 = rdd.flatMap( new FlatMapFunction> BSONObject>, Double>() { >> @Override >> public Iterable call(Tuple2 e) { >> BSONObject doc = e._2(); >> List> vals = (List>)doc.get("data"); >> List results = new ArrayList(); >> for (int i=0; i< vals.size();i++ ) >> results.add((Double)vals.get(i).get(0)); >> return results; >> >> } >> }); >> >> logger.info("Take: {}", rdd2.take(100)); >> logger.info("Count: {}", rdd2.count()); >> >> >> Any ideas on what I am doing wrong ? >> >> Thanks, >> >> Yadid >> >> >> >> -- >> Yadid Ayzenberg >> Graduate Student and Research Assistant >> Affective Computing >> Phone: 617-866-7226 >> Room: E14-274G >> MIT Media Lab >> 75 Amherst st, Cambridge, MA, 02139 >> >> >> -- Yadid Ayzenberg Graduate Student and Research Assistant Affective Computing Phone: 617-866-7226 Room: E14-274G MIT Media Lab 75 Amherst st, Cambridge, MA, 02139