spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Facai (颜发才) (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
Date Sun, 23 Apr 2017 05:04:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yan Facai (颜发才) updated SPARK-16957:
------------------------------------
    Description: 
We should be using weighted split points rather than the actual continuous binned feature
values. For instance, in a dataset containing binary features (that are fed in as continuous
ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with
some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split
point should be a weighted split point of the two values of the "innermost" feature bins;
e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}.

Example:
{code}
+--------+--------+-----+-----+
|feature0|feature1|label|count|
+--------+--------+-----+-----+
|     0.0|     0.0|  0.0|   23|
|     1.0|     0.0|  0.0|    2|
|     0.0|     0.0|  1.0|    2|
|     0.0|     1.0|  0.0|    7|
|     1.0|     0.0|  1.0|   23|
|     0.0|     1.0|  1.0|   18|
|     1.0|     1.0|  1.0|    7|
|     1.0|     1.0|  0.0|   18|
+--------+--------+-----+-----+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
    Predict: -0.56
   Else (feature 1 > 0.0)
    Predict: 0.29333333333333333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
    Predict: 0.56
   Else (feature 1 > 0.0)
    Predict: -0.29333333333333333
{code}

  was:
Just like R's gbm, we should be using weighted split points rather than the actual continuous
binned feature values. For instance, in a dataset containing binary features (that are fed
in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For
any real data with some smoothness qualities, this is asymptotically bad compared to GBM's
approach. The split point should be a weighted split point of the two values of the "innermost"
feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be
at {{0.75}}.

Example:
{code}
+--------+--------+-----+-----+
|feature0|feature1|label|count|
+--------+--------+-----+-----+
|     0.0|     0.0|  0.0|   23|
|     1.0|     0.0|  0.0|    2|
|     0.0|     0.0|  1.0|    2|
|     0.0|     1.0|  0.0|    7|
|     1.0|     0.0|  1.0|   23|
|     0.0|     1.0|  1.0|   18|
|     1.0|     1.0|  1.0|    7|
|     1.0|     1.0|  0.0|   18|
+--------+--------+-----+-----+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
    Predict: -0.56
   Else (feature 1 > 0.0)
    Predict: 0.29333333333333333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
    Predict: 0.56
   Else (feature 1 > 0.0)
    Predict: -0.29333333333333333
{code}


> Use weighted midpoints for split values.
> ----------------------------------------
>
>                 Key: SPARK-16957
>                 URL: https://issues.apache.org/jira/browse/SPARK-16957
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Vladimir Feinberg
>            Priority: Trivial
>
> We should be using weighted split points rather than the actual continuous binned feature
values. For instance, in a dataset containing binary features (that are fed in as continuous
ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with
some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split
point should be a weighted split point of the two values of the "innermost" feature bins;
e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}.
> Example:
> {code}
> +--------+--------+-----+-----+
> |feature0|feature1|label|count|
> +--------+--------+-----+-----+
> |     0.0|     0.0|  0.0|   23|
> |     1.0|     0.0|  0.0|    2|
> |     0.0|     0.0|  1.0|    2|
> |     0.0|     1.0|  0.0|    7|
> |     1.0|     0.0|  1.0|   23|
> |     0.0|     1.0|  1.0|   18|
> |     1.0|     1.0|  1.0|    7|
> |     1.0|     1.0|  0.0|   18|
> +--------+--------+-----+-----+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>    If (feature 1 <= 0.0)
>     Predict: -0.56
>    Else (feature 1 > 0.0)
>     Predict: 0.29333333333333333
>   Else (feature 0 > 0.0)
>    If (feature 1 <= 0.0)
>     Predict: 0.56
>    Else (feature 1 > 0.0)
>     Predict: -0.29333333333333333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message