From issues-return-195549-archive-asf-public=cust-asf.ponee.io@spark.apache.org Wed Jul 4 15:51:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8579B180608 for ; Wed, 4 Jul 2018 15:51:03 +0200 (CEST) Received: (qmail 81112 invoked by uid 500); 4 Jul 2018 13:51:02 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 81103 invoked by uid 99); 4 Jul 2018 13:51:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2018 13:51:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 36728C00B1 for ; Wed, 4 Jul 2018 13:51:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id O-QDBrOhm-Vw for ; Wed, 4 Jul 2018 13:51:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2930F5F36B for ; Wed, 4 Jul 2018 13:51:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 74822E00FF for ; Wed, 4 Jul 2018 13:51:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 291D327504 for ; Wed, 4 Jul 2018 13:51:00 +0000 (UTC) Date: Wed, 4 Jul 2018 13:51:00 +0000 (UTC) From: "Wenchen Fan (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532775#comment-16532775 ] Wenchen Fan commented on SPARK-24438: ------------------------------------- AFAIK this is the same behavior from Hive. null and empty string are both invalid partition values, so they are same when used as partition values. cc [~gatorsmile] [~dongjoon] > Empty strings and null strings are written to the same partition > ---------------------------------------------------------------- > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Mukul Murthy > Priority: Major > > When you partition on a string column that has empty strings and nulls, they are both written to the same default partition. When you read the data back, all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org