spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liwei Lin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-16460) Spark 2.0 CSV ignores NULL value in Date format
Date Sat, 09 Jul 2016 14:33:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369085#comment-15369085
] 

Liwei Lin edited comment on SPARK-16460 at 7/9/16 2:33 PM:
-----------------------------------------------------------

Hi, [~marcelboldt]. Thanks for reporting this! I will submit a patch shortly.

A scala reproducer (for reviewers):
{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SPARK_16460 extends App {

  val sdf = SparkSession.builder().master("local").getOrCreate().read
    .schema(StructType(List(
      StructField("id", IntegerType),
      StructField("d", DateType),
      StructField("dtwo", DateType))))
    .option("inferSchema", false.toString)
    .option("delimiter", "|")
    .option("dateFormat", "yyyy-MM-dd")
    .option("nullValue", "")
    .option("mode", "PERMISSIVE")
    .csv("test.csv")

  sdf.show(1)

}
{code}


was (Author: proflin):
Hi, [~marcelboldt]. Thanks for reporting this! I will submit a patch shortly.

> Spark 2.0 CSV ignores NULL value in Date format
> -----------------------------------------------
>
>                 Key: SPARK-16460
>                 URL: https://issues.apache.org/jira/browse/SPARK-16460
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: SparkR
>            Reporter: Marcel Boldt
>            Priority: Minor
>
> Trying to read a CSV file to Spark (using SparkR) containing just this data row:
> {code}
>     1|1998-01-01||
> {code}
> Using Spark 1.6.2 (Hadoop 2.6) gives me 
> {code}
>     > head(sdf)
>       id          d dtwo
>     1  1 1998-01-01   NA
> {code}
> Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: 
> {panel}
> > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage
0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException:
Unparseable date: ""
> 	at java.text.DateFormat.parse(DateFormat.java:357)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
> 	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> 	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> 	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> 	at scala.collection.Iterator$$anon$12.hasNext(Itera...
> {panel}
> The problem seems indeed the NULL value here as with a valid date in the third CSV column
it works.
> R code:
> {code}
>     #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
>     Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
>     .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
>     library(SparkR)
>     
>     sc <-
>         sparkR.init(
>             master = "local",
>             sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
>         )
>     sqlContext <- sparkRSQL.init(sc)
>     
>     
>     st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo",
"date"))
>     
>     sdf <- read.df(
>         sqlContext,
>         path = "d:/date_test.csv",
>         source = "com.databricks.spark.csv",
>         schema = st,
>         inferSchema = "false",
>         delimiter = "|",
>         dateFormat = "yyyy-MM-dd",
>         nullValue = "",
>         mode = "PERMISSIVE"
>     )
>     
>     head(sdf)
>     
>     sparkR.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message