spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17219) QuantileDiscretizer should handle NaN values gracefully
Date Tue, 18 Oct 2016 17:53:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-17219:
--------------------------------------
    Shepherd: Joseph K. Bradley

> QuantileDiscretizer should handle NaN values gracefully
> -------------------------------------------------------
>
>                 Key: SPARK-17219
>                 URL: https://issues.apache.org/jira/browse/SPARK-17219
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Barry Becker
>            Assignee: Vincent
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that contains NaNs,
it will create (possibly more than one) NaN split(s) before the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column using the
QuantileDiscretizer with 10 bins specified. The age column as a lot of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive number and less
than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket all their
own. My suggestions would be to include an initial NaN split value that is always there, just
like the sentinel Infinities are. If that were the case, then the splits for the example above
might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't make much
sense. Not sure if the NaN bucket counts toward numBins or not. I do think it should always
be there though in case future data has null even though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message