spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Farooq Qaiser (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-30319) Adds a stricter version of as[T]
Date Thu, 26 Dec 2019 18:13:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003718#comment-17003718
] 

Farooq Qaiser commented on SPARK-30319:
---------------------------------------

I have written similar variants of this feature (using scala's implicit-conversion technique
to monkey-patch the Dataset class) across multiple organizations/codebases now and wanted
to share my thoughts in case its helpful to the discussion. 

I can affirm that this would be a valuable feature to have in Spark. Without this feature,
our developers would nearly always have to pair an {{as}} operation with a {{select}} operation. As
such, my preference would be to change the existing Dataset {{as[T]}} method to add this
strict-ness by default when {{T}} is a class. This would be a breaking change but since
the next version of Spark is a major release (3.0.0), this should be okay. 

Also, I saw that in your PR you included eager-casting-of-Column-types. I'm not sure if this
is a good idea although I can't think of any concrete objections. In my own implementations
of this feature, I've always just raised an exception if the column types don't match what's
specified in {{T}}, leaving it to the developer to explicitly cast Columns to the correct 
types prior to using this feature. 

> Adds a stricter version of as[T]
> --------------------------------
>
>                 Key: SPARK-30319
>                 URL: https://issues.apache.org/jira/browse/SPARK-30319
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.4.4
>            Reporter: Enrico Minack
>            Priority: Major
>             Fix For: 3.0.0
>
>
> The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv").
The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic
to the schema of df. The expected behaviour is not provided elsewhere:
>  * Extra columns that are not part of the type {{T}} are not dropped.
>  * Order of columns is not aligned with schema of {{T}}.
>  * Columns are not cast to the types of {{T}}'s fields. They have to be cast explicitly.
> A method that enforces schema of T on a given Dataset would be very convenient and allows
to articulate and guarantee above assumptions about your data with the native Spark Dataset
API. This method plays a more explicit and enforcing role than as[T] with respect to columns,
column order and column type.
> Possible naming of a stricter version of {{as[T]}}:
>  * {{as[T](strict = true)}}
>  * {{toDS[T]}} (as in {{toDF}})
>  * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
> The naming {{toDS[T]}} is chosen here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message