spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <and...@databricks.com>
Subject Re: Spark serialization in closure
Date Thu, 09 Jul 2015 20:58:40 GMT
Hi Chen,

I believe the issue is that `object foo` is a member of `object testing`,
so the only way to access `object foo` is to first pull `object testing`
into the closure, then access a pointer to get to `object foo`. There are
two workarounds that I'm aware of:

(1) Move `object foo` outside of `object testing`. This is only a problem
because of the nested objects. Also, by design it's simpler to reason about
but that's a separate discussion.

(2) Create a local variable for `foo.v`. If all your closure cares about is
the integer, then it makes sense to add a `val v = foo.v` inside `func` and
use this in your closure instead. This avoids pulling in $outer pointers
into your closure at all since it only references local variables.

As others have commented, I think this is more of a Scala problem than a
Spark one.

Let me know if these work,
-Andrew

2015-07-09 13:36 GMT-07:00 Richard Marscher <rmarscher@localytics.com>:

> Reading that article and applying it to your observations of what happens
> at runtime:
>
> shouldn't the closure require serializing testing? The foo singleton
> object is a member of testing, and then you call this foo value in the
> closure func and further in the foreachPartition closure. So following by
> that article, Scala will attempt to serialize the containing object/class
> testing to get the foo instance.
>
> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song <chen.song.82@gmail.com> wrote:
>
>> Repost the code example,
>>
>> object testing extends Serializable {
>>     object foo {
>>       val v = 42
>>     }
>>     val list = List(1,2,3)
>>     val rdd = sc.parallelize(list)
>>     def func = {
>>       val after = rdd.foreachPartition {
>>         it => println(foo.v)
>>       }
>>     }
>>   }
>>
>> On Thu, Jul 9, 2015 at 4:09 PM, Chen Song <chen.song.82@gmail.com> wrote:
>>
>>> Thanks Erik. I saw the document too. That is why I am confused because
>>> as per the article, it should be good as long as *foo *is serializable.
>>> However, what I have seen is that it would work if *testing* is
>>> serializable, even foo is not serializable, as shown below. I don't know if
>>> there is something specific to Spark.
>>>
>>> For example, the code example below works.
>>>
>>> object testing extends Serializable {
>>>
>>>     object foo {
>>>
>>>       val v = 42
>>>
>>>     }
>>>
>>>     val list = List(1,2,3)
>>>
>>>     val rdd = sc.parallelize(list)
>>>
>>>     def func = {
>>>
>>>       val after = rdd.foreachPartition {
>>>
>>>         it => println(foo.v)
>>>
>>>       }
>>>
>>>     }
>>>
>>>   }
>>>
>>> On Thu, Jul 9, 2015 at 12:11 PM, Erik Erlandson <eje@redhat.com> wrote:
>>>
>>>> I think you have stumbled across this idiosyncrasy:
>>>>
>>>>
>>>> http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>> > I am not sure this is more of a question for Spark or just Scala but
>>>> I am
>>>> > posting my question here.
>>>> >
>>>> > The code snippet below shows an example of passing a reference to a
>>>> closure
>>>> > in rdd.foreachPartition method.
>>>> >
>>>> > ```
>>>> > object testing {
>>>> >     object foo extends Serializable {
>>>> >       val v = 42
>>>> >     }
>>>> >     val list = List(1,2,3)
>>>> >     val rdd = sc.parallelize(list)
>>>> >     def func = {
>>>> >       val after = rdd.foreachPartition {
>>>> >         it => println(foo.v)
>>>> >       }
>>>> >     }
>>>> >   }
>>>> > ```
>>>> > When running this code, I got an exception
>>>> >
>>>> > ```
>>>> > Caused by: java.io.NotSerializableException:
>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$
>>>> > Serialization stack:
>>>> > - object not serializable (class:
>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value:
>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824)
>>>> > - field (class:
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
>>>> > name: $outer, type: class
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$)
>>>> > - object (class
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
>>>> > <function1>)
>>>> > ```
>>>> >
>>>> > It looks like Spark needs to serialize `testing` object. Why is it
>>>> > serializing testing even though I only pass foo (another serializable
>>>> > object) in the closure?
>>>> >
>>>> > A more general question is, how can I prevent Spark from serializing
>>>> the
>>>> > parent class where RDD is defined, with still support of passing in
>>>> > function defined in other classes?
>>>> >
>>>> > --
>>>> > Chen Song
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>>
>> --
>> Chen Song
>>
>>
>
>
> --
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Mime
View raw message