spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michel Lemay (JIRA)" <>
Subject [jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType
Date Thu, 31 Mar 2016 18:12:25 GMT


Michel Lemay commented on SPARK-12436:

This example fails to illustrate the issue since the order of the values is important..  It
firsts sees a StructType with fields, then an empty StrucType and finally an empty StringType
which works as expected.  Reverse that and you are doomed.

Worse than that, consider Spark Streaming where you get bunch of lines and not all of the
fields are populated as is easily imaginable in a short 1-2 seconds batches, your sampling
is really small.  You end up with multiple incompatible schemas and they are not mergable
because of that StringType thing.  And preserving NullTypes where needed won't work either
because of Parquet serialization.  (See by other comment below)

> If all values of a JSON field is null, JSON's inferSchema should return NullType instead
of StringType
> ------------------------------------------------------------------------------------------------------
>                 Key: SPARK-12436
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: starter
> Right now, JSON's inferSchema will return {{StringType}} for a field that always has
null values or an {{ArrayType(StringType)}}  for a field that always has empty array values.
Although this behavior makes writing JSON data to other data sources easy (i.e. when writing
data, we do not need to remove those {{NullType}} or {{ArrayType(NullType)}} columns), it
makes downstream application hard to reason about the actual schema of the data and thus makes
schema merging hard. We should allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
Also, we need to make sure that when we write data out, we should remove those {{NullType}}
or {{ArrayType(NullType)}} columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same thing for
empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and {{ArrayType(NullType)}}
columns from the data that will be write out for all data sources (i.e. data sources based
our data source API and Hive tables). Or, we should just add this operation for certain data
sources (e.g. Parquet). For example, we may not need this operation for Hive because Hive
has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message