spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
Date Wed, 23 Aug 2017 13:49:05 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138365#comment-16138365
] 

Hyukjin Kwon commented on SPARK-21820:
--------------------------------------

I think the preferable format should be {{format("csv")}} for the built-in Spark CSV. {{.format("com.databricks.spark.csv")}}
basically indicates thirdparty CSV library in Databricks repository which is not for Spark
2.x, although we had to make some changes within Spark to choose Spark's internal one for
such cases, e.g., SPARK-20590. Let's avoid to report a JIRA with {{"com.databricks.spark.csv"}}
in the future to prevent confusion.

For {{multiLine}} in CSV, the newline is dependent on OS whereas TEXT, JSON and CSV datasources
by default deal with some newlines together such as on Windows and Linux, by Hadoop's library,
up to my knowledge. I proposed a change for a configurable newline - https://github.com/apache/spark/pull/18581.
I guess this will address this problem together.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-21820
>                 URL: https://issues.apache.org/jira/browse/SPARK-21820
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Kumaresh C R
>              Labels: features
>         Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make multiLine=false,
it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = spark.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").option("parserLib", "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: string ...
1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = spark.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").option("parserLib", "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: string ...
1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message