Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 31 Mar 2016 18:12:25 +0000 (UTC)
From: "Michel Lemay (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12923186.1450487052000.103326.1459447945459@Atlassian.JIRA>
In-Reply-To: <JIRA.12923186.1450487052000@Atlassian.JIRA>
References: <JIRA.12923186.1450487052000@Atlassian.JIRA>
 <JIRA.12923186.1450487052171@arcas>
Subject: [jira] [Commented] (SPARK-12436) If all values of a JSON field is
 null, JSON's inferSchema should return NullType instead of StringType
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220353#comment-15220353 ] 

Michel Lemay commented on SPARK-12436:
--------------------------------------

This example fails to illustrate the issue since the order of the values is important..  It firsts sees a StructType with fields, then an empty StrucType and finally an empty StringType which works as expected.  Reverse that and you are doomed.

Worse than that, consider Spark Streaming where you get bunch of lines and not all of the fields are populated as is easily imaginable in a short 1-2 seconds batches, your sampling is really small.  You end up with multiple incompatible schemas and they are not mergable because of that StringType thing.  And preserving NullTypes where needed won't work either because of Parquet serialization.  (See by other comment below)

> If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12436
>                 URL: https://issues.apache.org/jira/browse/SPARK-12436
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that always has null values or an {{ArrayType(StringType)}}  for a field that always has empty array values. Although this behavior makes writing JSON data to other data sources easy (i.e. when writing data, we do not need to remove those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream application hard to reason about the actual schema of the data and thus makes schema merging hard. We should allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. Also, we need to make sure that when we write data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and {{ArrayType(NullType)}} columns from the data that will be write out for all data sources (i.e. data sources based our data source API and Hive tables). Or, we should just add this operation for certain data sources (e.g. Parquet). For example, we may not need this operation for Hive because Hive has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org