Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5DDE197C7 for ; Thu, 31 Mar 2016 18:12:25 +0000 (UTC) Received: (qmail 41456 invoked by uid 500); 31 Mar 2016 18:12:25 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 41418 invoked by uid 500); 31 Mar 2016 18:12:25 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 41392 invoked by uid 99); 31 Mar 2016 18:12:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Mar 2016 18:12:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7121E2C033A for ; Thu, 31 Mar 2016 18:12:25 +0000 (UTC) Date: Thu, 31 Mar 2016 18:12:25 +0000 (UTC) From: "Michel Lemay (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220353#comment-15220353 ] Michel Lemay commented on SPARK-12436: -------------------------------------- This example fails to illustrate the issue since the order of the values is important.. It firsts sees a StructType with fields, then an empty StrucType and finally an empty StringType which works as expected. Reverse that and you are doomed. Worse than that, consider Spark Streaming where you get bunch of lines and not all of the fields are populated as is easily imaginable in a short 1-2 seconds batches, your sampling is really small. You end up with multiple incompatible schemas and they are not mergable because of that StringType thing. And preserving NullTypes where needed won't work either because of Parquet serialization. (See by other comment below) > If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-12436 > URL: https://issues.apache.org/jira/browse/SPARK-12436 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Reynold Xin > Labels: starter > > Right now, JSON's inferSchema will return {{StringType}} for a field that always has null values or an {{ArrayType(StringType)}} for a field that always has empty array values. Although this behavior makes writing JSON data to other data sources easy (i.e. when writing data, we do not need to remove those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream application hard to reason about the actual schema of the data and thus makes schema merging hard. We should allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. Also, we need to make sure that when we write data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} columns first. > Besides {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). > To finish this work, we need to finish the following sub-tasks: > * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. > * Determine whether we need to add the operation of removing {{NullType}} and {{ArrayType(NullType)}} columns from the data that will be write out for all data sources (i.e. data sources based our data source API and Hive tables). Or, we should just add this operation for certain data sources (e.g. Parquet). For example, we may not need this operation for Hive because Hive has VoidObjectInspector. > * Implement the change and get it merged to Spark master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org