spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fangshi Li (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
Date Sat, 12 May 2018 05:07:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Fangshi Li updated SPARK-24256:
-------------------------------
    Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class,
tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we
find it is not flexible for other user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can
use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using
such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot
be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's
reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in
a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with subfields
being any other user-defined type, is because ExpressionEncoder does not discover the implicit
Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between ExpressionEncoder and any
other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover
the serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with
cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few quarters. We want
to propose to upstream as we think this is a useful feature to make Dataset more flexible
to user types.

 

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class,
tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we
find it is not flexible for other user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can
use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using
such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot
be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's
reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a
case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply define a Scala
case class/tuple with subfields being any other user-defined type, since ExpressionEncoder
does not discover the implicit Encoder for the user-defined fields, thus can not use any Encoder
to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between ExpressionEncoder and any
other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover
the serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it work with conf

spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few quarters. We want
to propose to upstream as we think this is a useful feature to make Dataset more flexible
to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case class and
tuple
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24256
>                 URL: https://issues.apache.org/jira/browse/SPARK-24256
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Fangshi Li
>            Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case
class, tuple and java bean class. Spark's Dataset natively supports these mentioned types,
but we find it is not flexible for other user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although
we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record,
using such Avro typed Dataset has many limitations: 
>  1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types
cannot be the field of this tuple.
>  2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's
reduceGroups, since the result is also a tuple.
>  3. We cannot augment an Avro SpecificRecord with additional primitive fields together
in a case class, which we find is a very common use case.
> The limitation that Spark does not support define a Scala case class/tuple with subfields
being any other user-defined type, is because ExpressionEncoder does not discover the implicit
Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined
fields in case class/tuple.
> To address this issue, we propose a trait as a contract(between ExpressionEncoder and
any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to
discover the serializer/deserializer/schema from the Encoder of the user-defined type.
> With this proposed patch and our minor modification in AvroEncoder, we remove these
limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord
= com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few quarters.
We want to propose to upstream as we think this is a useful feature to make Dataset more flexible
to user types.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message