spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Gekk (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-24269) Infer nullability rather than declaring all columns as nullable
Date Mon, 14 May 2018 12:32:00 GMT
Maxim Gekk created SPARK-24269:
----------------------------------

             Summary: Infer nullability rather than declaring all columns as nullable
                 Key: SPARK-24269
                 URL: https://issues.apache.org/jira/browse/SPARK-24269
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Maxim Gekk


Currently, CSV and JSON datasource set the *nullable* flag to true independently from data
itself during schema inferring.

JSON: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
CSV: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51

For example, source dataset has schema:
{code}
root
 |-- item_id: integer (nullable = false)
 |-- country: string (nullable = false)
 |-- state: string (nullable = false)
{code}

If we save it and read again the schema of the inferred dataset is
{code}
root
 |-- item_id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- state: string (nullable = true)
{code}
The ticket aims to set the nullable flag more precisely during schema inferring based on read
data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message