spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations
Date Wed, 17 May 2017 20:44:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014728#comment-16014728
] 

Josh Rosen commented on SPARK-14584:
------------------------------------

[~maropu], yep, I think we can close it. I'll mark as fixed by SPARK-18284. Thanks for tracking
down the PR / tests.

> Improve recognition of non-nullability in Dataset transformations
> -----------------------------------------------------------------
>
>                 Key: SPARK-14584
>                 URL: https://issues.apache.org/jira/browse/SPARK-14584
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be null. For
instance, a field in a case class with a primitive type will never return null. However, there
are currently several cases in the Dataset API where we do not properly recognize this non-nullability.
For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when constructing
serializer expressions in ExpressionEncoders. The following assertion will currently fail
if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
>     require(field.dataType == fieldSerializer.dataType, s"Field ${field.name}'s data
type is " +
>       s"${field.dataType} in the schema but ${fieldSerializer.dataType} in its serializer")
>     require(field.nullable == fieldSerializer.nullable, s"Field ${field.name}'s nullability
is " +
>       s"${field.nullable} in the schema but ${fieldSerializer.nullable} in its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder allows for
nullability, but occasionally we see a mismatch in the datatypes due to disagreements over
the nullability of nested structs' fields (or fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a struct's schema
we consider its fields' nullability to be independent of the nullability of the struct itself,
whereas in the serializer expressions we are considering those field extraction expressions
to be nullable if the input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to leave the
serializers unchanged and have ObjectOperator derive its output attributes from an explicitly-passed
schema rather than using the serializers' attributes. However, I worry that this might introduce
bugs in case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message