spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonas Amrich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
Date Wed, 14 Mar 2018 16:33:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398842#comment-16398842
] 

Jonas Amrich commented on SPARK-22674:
--------------------------------------

[~lebedev] I completely agree, removing it is IMHO the cleanest solution.

I think namedtuples were originally used for representation of structured data in RDDs - see
-SPARK-2010- and -SPARK-1687- . This mechanism was then replaced by the Row class in DataFrame
API and it seems to me there is no use of it in pyspark internals. However removing it would
still be a BC break.

> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Jonas Amrich
>            Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however this breaks
serialization of its subclasses. With current implementation, any subclass will be serialized
(and deserialized) as it's parent namedtuple. Consider this code, which will fail with {{AttributeError:
'Point' object has no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing pyspark breaks
serialization of namedtuple subclasses even in code which is not related to spark / distributed
execution. I don't see any clean solution to this; a possible workaround may be to limit serialization
hack only to direct namedtuple subclasses like in https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message