spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kuba Tyszko (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.
Date Fri, 16 Dec 2016 21:53:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755597#comment-15755597
] 

Kuba Tyszko edited comment on SPARK-18906 at 12/16/16 9:53 PM:
---------------------------------------------------------------

Well, in csv null can either be an empty field or in this case a dedicated value (NA), but
some data providers use empty string to indicate an empty value as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue settings - but
that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a predictable way.

I understand this may look unclean, but unfortunately some reputable data providers do that...
- there is nothing we can do to stop them...
In fact, for example excel can be set to always quote columns when exporting to CSV, it can
be limited to only text columns - but I don't think we can assume that users won't put numbers
in a text column.

We're dealing with completely untyped data source - it's better to be robust..



was (Author: kubatyszko):
Well, in csv null can either be an empty field or in this case a dedicated value (NA), but
some data providers use empty string to indicate an empty value as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue settings - but
that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a predictable way.

I understand this may look unclean, but unfortunately some reputable data providers do that...
- there is nothing we can do to stop them...

> CSV parser should return null for empty (or with "") numeric columns.
> ---------------------------------------------------------------------
>
>                 Key: SPARK-18906
>                 URL: https://issues.apache.org/jira/browse/SPARK-18906
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Kuba Tyszko
>            Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's translation to
a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may contain empty
values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing empty value
or an empty string.
> Example:
> ---------------
> |char|int1|int2|
> ---------------
> |a|1|2|
> ---------------
> |a|  |0|
> ---------------
> |NA|""|""|
> ----------------
> This example illustrates that column "char" may contain an empty value indicated as "NA",
column int1 has a "true null" value but then both int1 and int2 columns have an empty string
set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message