spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franklyn D'souza" <franklyn.dso...@shopify.com>
Subject Operations on DataFrames with User Defined Types in pyspark
Date Thu, 11 Feb 2016 21:42:03 GMT
I'm using the UDT api to work with a custom Money datatype in dataframes.
heres how i have it setup

class StringUDT(UserDefinedType):


    @classmethod
    def sqlType(self):
        return StringType()

    @classmethod
    def module(cls):
        return cls.__module__

    @classmethod
    def scalaUDT(cls):
        return ''

    def serialize(self, obj):
        return str(obj)

    def deserialize(self, datum):
        return Money(datum)


class MoneyUDT(StringUDT):
    pass

Money.__UDT__ = MoneyUDT()

I then create a DataFrame like so

df = sc.sql.createDataFrame([[Money("25.0")], [Money("100.0")]], spark_schema)

However i've run into a few snags with this. DFs created using this
UDT can not be orderedBy the UDT column and i can't Union two DFs that
have this UDT on one of their columns.

Is this expected behaviour ? or is my UDT setup wrong ?.

Mime
View raw message