spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Gaido (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-24438) Empty strings and null strings are written to the same partition
Date Mon, 09 Jul 2018 08:38:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536682#comment-16536682
] 

Marco Gaido edited comment on SPARK-24438 at 7/9/18 8:37 AM:
-------------------------------------------------------------

IIRC, Hive has a placeholder string (\_\_HIVE_DEFAULT_PARTITION\_\_) for null value in partitions.


was (Author: mgaido):
IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value in partitions.

> Empty strings and null strings are written to the same partition
> ----------------------------------------------------------------
>
>                 Key: SPARK-24438
>                 URL: https://issues.apache.org/jira/browse/SPARK-24438
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they are both
written to the same default partition. When you read the data back, all those values get read
back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message