spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fu Shanshan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21359) frequency discretizer
Date Tue, 11 Jul 2017 01:23:02 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467
] 

Fu Shanshan commented on SPARK-21359:
-------------------------------------

but why in the example:
Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), (7, 10.1), (8,
1.1), (9, 16.0), (10, 20.0), (11, 20.0)) 

QuantileDiscretizer result   
+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   3.0|
|  1|19.0|   3.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   1.0|
|  5| 1.0|   0.0|
|  6| 9.1|   2.0|
|  7|10.1|   2.0|
|  8| 1.1|   0.0|
|  9|16.0|   2.0|
| 10|20.0|   3.0|
| 11|20.0|   3.0|
+---+----+------+

for number 18. it belong to bin 3. I thought it is because it makes equal-width bins, so the
bin array is (0, 5, 10, 15, 20), so 18 is in the last bin.
but my result, for number 18, it should be in bin 2. for equal frequency definition, so the
bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the bin 2, instead of the last bin.
Not sure am I misunderstood this questions. Thank you for your patiences.

> frequency discretizer
> ---------------------
>
>                 Key: SPARK-21359
>                 URL: https://issues.apache.org/jira/browse/SPARK-21359
>             Project: Spark
>          Issue Type: New JIRA Project
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal intervals)
or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message