spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: PySpark - Dill serialization
Date Fri, 06 Dec 2013 19:28:36 GMT
I tried replacing cloudpickle with Dill (
https://github.com/JoshRosen/incubator-spark/commit/2ac8986f3009f0dc133b11d16887fc8ddb33c3d1)
but I ran into a few issues.

It looks like Dill pickles function closures differently for functions
defined in doctests versus in module code / the shell, which breaks
PySpark's test suite; I opened an issue for this in the Dill repo:
https://github.com/uqfoundation/dill/issues/18.

Its closure cleaning may work differently than cloudpickle's because I've
also encountered some examples that also fail in the shell (accumulators).

Many simple cases, like PySpark's wordcount.py example, work fine, so I'm
hoping we'll be able to make the switch to Dill if those doctest issues are
resolved.


On Thu, Dec 5, 2013 at 8:15 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Looks cool! Josh, if you replace CloudPickle with this, make sure to also
> update the LICENSE file, which is supposed to contain third-party licenses.
>
> Matei
>
> On Dec 5, 2013, at 8:02 PM, Josh Rosen <rosenville@gmail.com> wrote:
>
> > Thanks for the link!  I wasn't aware of Dill, but it looks like a nice
> > library.  I like that it's being actively developed:
> > https://github.com/uqfoundation/dill
> >
> > It also seems to work correctly for a few edge-cases that cloudpickle
> > didn't handle properly, such as serializing operator.itemgetter instances
> > (see https://spark-project.atlassian.net/browse/SPARK-791).
> >
> > I'll put together a pull request to replace CloudPickle with Dill.  Dill
> > uses a 3-clause BSD license, so we should be able to package it into an
> > .egg in the python/lib/ folder like we did for Py4J.  It will be
> > interesting to see whether the change has any performance impact,
> although
> > the recent custom serializers pull request should help with that since it
> > would let us use Dill for serializing functions and a faster serializer
> for
> > serializing data.
> >
> > - Josh
> >
> >
> >
> >
> > On Thu, Dec 5, 2013 at 4:49 AM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
> >
> >> Hi devs
> >>
> >> I came across Dill (
> >> http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for
> Python
> >> serialization. Was wondering if it may be a replacement to the
> cloudpickle
> >> stuff (and remove that piece of code that needs to be maintained within
> >> PySpark)?
> >>
> >> Josh have you looked into Dill? Any thoughts?
> >>
> >> N
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message