spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonas Amrich (JIRA)" <>
Subject [jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
Date Tue, 05 Dec 2017 08:25:00 GMT


Jonas Amrich commented on SPARK-22674:

No worries, I'm glad we're on the same page now. However the fix I've suggested is just a
workaround and I'm not 100% sure it won't mess something else up.
It would be nicer if there was some better solution that doesn't involve hijacking at all.
But, again, this would require larger design changes.

> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>                 Key: SPARK-22674
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Jonas Amrich
> Pyspark monkey patches the namedtuple class to make it serializable, however this breaks
serialization of its subclasses. With current implementation, any subclass will be serialized
(and deserialized) as it's parent namedtuple. Consider this code, which will fail with {{AttributeError:
'Point' object has no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing pyspark breaks
serialization of namedtuple subclasses even in code which is not related to spark / distributed
execution. I don't see any clean solution to this; a possible workaround may be to limit serialization
hack only to direct namedtuple subclasses like in

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message