spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
Date Thu, 18 Aug 2016 11:07:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426270#comment-15426270
] 

Sean Owen commented on SPARK-17086:
-----------------------------------

I suppose it depends on the desired semantics of QuantileDiscretizer. It sounds like already
would return fewer buckets than requested. (That could or should be documented.) 

It makes it sound like it tries to make the buckets match quantiles of the input, even if
it doesn't guarantee it. The bins you describe here would result in pretty lopsided binning,
but, any consistent scheme would be the same.

OK I think I would agree with matching the 1.6.2 behavior then and documenting that the number
of buckets may be smaller than requested, rather than return buckets some of which will always
be empty. Let's just document / add a test for it.

I don't think the test should involve the number of distinct input elements (which could be
expensive to compute); you just want to collapse adjacent splits that are equal right? That
will cover more cases too.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value)
on valid data
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17086
>                 URL: https://issues.apache.org/jira/browse/SPARK-17086
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which I believe
is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce equal weight
splits like this
> {code}
> new QuantileDiscretizer()
>         .setInputCol("intData")
>         .setOutputCol("intData_bin")
>         .setNumBuckets(10)
>         .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message