spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT
Date Thu, 26 May 2016 23:07:05 GMT
This is unfortunately due to the way we set handle default values in
Python. I agree it doesn't follow the principle of least astonishment.

Maybe the best thing to do here is to put the actual default values in the
Python API for csv (and json, parquet, etc), rather than using None in
Python. This would require us to duplicate default values twice (once in
data source options, and another in the Python API), but that's probably OK
given they shouldn't change all the time.

Ticket https://issues.apache.org/jira/browse/SPARK-15585




On Thu, May 26, 2016 at 3:35 PM, Koert Kuipers <koert@tresata.com> wrote:

> in spark 1.6.1 we used:
>  sqlContext.read
>       .format("com.databricks.spark.csv")
>       .delimiter("~")
>       .option("quote", null)
>
> this effectively turned off quoting, which is a necessity for certain data
> formats where quoting is not supported and "\"" is a valid character itself
> in the data.
>
> in spark 2.0.0-SNAPSHOT we did same thing:
>  sqlContext.read
>       .format("csv")
>       .delimiter("~")
>       .option("quote", null)
>
> but this did not work, we got weird blowups where spark was trying to
> parse thousands of lines as if it is one record. the reason was that a
> (valid) quote character ("\"") was present in the data. for example
> a~b"c~d
>
> as it turns out setting quote to null does not turn of quoting anymore.
> instead it means to use the default quote character.
>
> does anyone know how to turn off quoting now?
>
> our current workaround is:
>  sqlContext.read
>       .format("csv")
>       .delimiter("~")
>       .option("quote", "☃")
>
> (we assume there are no unicode snowman's in our data...)
>
>
>

Mime
View raw message