spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Furcy Pin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-20384) supporting value classes over primitives in DataSets
Date Thu, 29 Mar 2018 10:11:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418691#comment-16418691
] 

Furcy Pin edited comment on SPARK-20384 at 3/29/18 10:10 AM:
-------------------------------------------------------------

+1 on this issue.

I think the generic use case is that the spark Encoder magic to automatically transform a
DataFrame into a case class currently only work for base types.

This is great if you have a 
{code:java}
case class Table(id: Long, attribute: String)
{code}
with simple attributes,

 

BUT, if you want to wrap your attribute into another simple class like this
{code:java}
case class Attribute(value: String) {
  // some specific methods...
}
case class Table(id: Long, attribute: Attribute){code}
Then this won't work automatically, unless the "attribute" column in your DataFrame is a struct
itself.

 

The problem is that currently there doesn't seem to be any simple way to achieve this, which
really limits the usefulness of the whole Encoder magic. 

And if a nice, simple way to achieve this exists, please document it as I did not find it.

 

 EDIT: after giving it some thought, I tried to do this:
{code:java}
implicit class Attribute(value: String)
case class Table(id: Long, attribute: Attribute){code}
But it does not work either. If it were possible like this, it would be a nice way to do
it.

 

 


was (Author: fpin):
+1 on this issue.


 I think the generic use case is that the spark Encoder magic to automatically transform
a DataFrame into a case class currently only work for base types.

This is great if you have a 
{code:java}
case class Table(id: Long, attribute: String)
{code}
with simple attributes,

 

BUT, if you want to wrap your attribute into another simple class like this
{code:java}
case class Attribute(value: String) {
  // some specific methods...
}
case class Table(id: Long, attribute: Attribute){code}
Then this won't work automatically, unless the "attribute" column in your DataFrame is a struct
itself.

 

The problem is that currently there doesn't seem to be any simple way to achieve this, which
really limits the usefulness of the whole Encoder magic. 

And if a nice, simple way to achieve this exists, please document it as I did not find it.

 

 

> supporting value classes over primitives in DataSets
> ----------------------------------------------------
>
>                 Key: SPARK-20384
>                 URL: https://issues.apache.org/jira/browse/SPARK-20384
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 2.1.0
>            Reporter: Daniel Davis
>            Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, I also
would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a value-class
for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if defined on the
value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+-------+
> | id|   name|
> +---+-------+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+-------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message