spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16472) Inconsistent nullability in schema after being read
Date Wed, 15 Mar 2017 00:05:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-16472:
---------------------------------
    Summary: Inconsistent nullability in schema after being read  (was: Inconsistent nullability
in schema after being read in SQL API.)

> Inconsistent nullability in schema after being read
> ---------------------------------------------------
>
>                 Key: SPARK-16472
>                 URL: https://issues.apache.org/jira/browse/SPARK-16472
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> It seems the data sources implementing {{FileFormat}} seems loading the data by forcing
the fields as nullable fields. It seems this was official documented SPARK-11360 and was discussed
here https://www.mail-archive.com/user@spark.apache.org/msg39230.html
> However, I realised that several APIs do not follow this. For example,
> {code}
> DataFrame.json(jsonRDD: RDD[String])
> {code}
> So, the codes below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
> val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
> val df = spark.read.schema(schema).json(rdd)
> df.printSchema()
> {code}
> prints below:
> {code}
> root
>  |-- a: integer (nullable = false)
> {code}
> This API loads the schema as it is after loading. However, the schema became different
when loading it by the API below (nullable fields) :
> {code}
> spark.read.format("json").schema(...).load(path).printSchema()
> {code}
> {code}
> spark.read.schema(...).load(path).printSchema()
> {code}
> produce below:
> {code}
> root
>  |-- a: integer (nullable = true)
> {code}
> In addition, this is happening for structured streaming as well. (even when we read batch
after writing it by structured streaming).
> While testing, I wrote some tests codes and patches. Please see the following PR for
more cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message