spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ondrej Kokes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
Date Sat, 19 May 2018 15:04:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481655#comment-16481655
] 

Ondrej Kokes commented on SPARK-22236:
--------------------------------------

There is one more setting that makes the default CSV parser in Spark in violation of RFC 4180,
that's the multiLine setting. It defaults to false, so newlines are treated as row separators,
but CSVs allow for newlines in fields, as long as the field is enclosed in double quotes.

Sadly, I think setting multiLine to true by default is less feasible than changing the escape
setting, because multiLine=false makes the parser easily parallelisable while still parsing
the majority of CSV data correctly. But, combined with mode=PERMISSIVE, this setting makes
the default parser a landmine.

> CSV I/O: does not respect RFC 4180
> ----------------------------------
>
>                 Key: SPARK-22236
>                 URL: https://issues.apache.org/jira/browse/SPARK-22236
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with a backslash
by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many
software packages) is to escape using a second double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
>     cw = csv.writer(f)
>     cw.writerow(['a 2.5" drive', 'another column'])
>     cw.writerow(['a "quoted" string', '"quoted"'])
>     cw.writerow([1,2])
> with open('testfile.csv') as f:
>     print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-....csv') as f:
>     cr = csv.reader(f)
>     print(next(cr))
>     print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would be useful
if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations).
I realise this would be a breaking change and thus if accepted, it would probably need to
result in a warning first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message