spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Gekk (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.
Date Fri, 25 May 2018 19:52:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maxim Gekk resolved SPARK-15125.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.0

The issue has been fixed by https://github.com/apache/spark/commit/7a2d4895c75d4c232c377876b61c05a083eab3c8

> CSV data source recognizes empty quoted strings in the input as null. 
> ----------------------------------------------------------------------
>
>                 Key: SPARK-15125
>                 URL: https://issues.apache.org/jira/browse/SPARK-15125
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Suresh Thalamati
>            Priority: Major
>             Fix For: 2.4.0
>
>
> CSV data source does not differentiate between empty quoted strings and empty fields
 as null. In some scenarios user would want  to differentiate between these values,  especially
in the context of SQL where NULL , and empty string have different meanings  If input data
happens to be dump from traditional relational data source, users will see different results
for the SQL queries. 
> {code}
> Repro:
> Test Data: (test.csv)
> year,make,model,comment,price
> 2017,Tesla,Mode 3,looks nice.,35000.99
> 2016,Chevy,Bolt,"",29000.00
> 2015,Porsche,"",,
> scala> val df= sqlContext.read.format("csv").option("header", "true").option("inferSchema",
"true").option("nullValue", null).load("/tmp/test.csv")
> df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields]
> scala> df.show
> +----+-------+------+-----------+--------+
> |year|   make| model|    comment|   price|
> +----+-------+------+-----------+--------+
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|       null| 29000.0|
> |2015|Porsche|  null|       null|    null|
> +----+-------+------+-----------+--------+
> Expected:
> +----+-------+------+-----------+--------+
> |year|   make| model|    comment|   price|
> +----+-------+------+-----------+--------+
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|           | 29000.0|
> |2015|Porsche|      |       null|    null|
> +----+-------+------+-----------+--------+
> {code}
> Testing a fix for the this issue. I will give a shot at submitting a PR for this soon.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message