From issues-return-195549-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Wed Jul  4 15:51:04 2018
Return-Path: <issues-return-195549-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8579B180608
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  4 Jul 2018 15:51:03 +0200 (CEST)
Received: (qmail 81112 invoked by uid 500); 4 Jul 2018 13:51:02 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 81103 invoked by uid 99); 4 Jul 2018 13:51:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2018 13:51:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 36728C00B1
	for <issues@spark.apache.org>; Wed,  4 Jul 2018 13:51:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.501
X-Spam-Level:
X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id O-QDBrOhm-Vw for <issues@spark.apache.org>;
	Wed,  4 Jul 2018 13:51:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2930F5F36B
	for <issues@spark.apache.org>; Wed,  4 Jul 2018 13:51:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 74822E00FF
	for <issues@spark.apache.org>; Wed,  4 Jul 2018 13:51:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 291D327504
	for <issues@spark.apache.org>; Wed,  4 Jul 2018 13:51:00 +0000 (UTC)
Date: Wed, 4 Jul 2018 13:51:00 +0000 (UTC)
From: "Wenchen Fan (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13163043.1527728179000.72845.1530712260157@Atlassian.JIRA>
In-Reply-To: <JIRA.13163043.1527728179000@Atlassian.JIRA>
References: <JIRA.13163043.1527728179000@Atlassian.JIRA> <JIRA.13163043.1527728179005@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-24438) Empty strings and null strings are
 written to the same partition
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532775#comment-16532775 ] 

Wenchen Fan commented on SPARK-24438:
-------------------------------------

AFAIK this is the same behavior from Hive. null and empty string are both invalid partition values, so they are same when used as partition values. cc  [~gatorsmile] [~dongjoon]

> Empty strings and null strings are written to the same partition
> ----------------------------------------------------------------
>
>                 Key: SPARK-24438
>                 URL: https://issues.apache.org/jira/browse/SPARK-24438
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they are both written to the same default partition. When you read the data back, all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org