spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
Date Wed, 16 Mar 2016 07:40:33 GMT

     [ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng updated SPARK-10788:
----------------------------------
    Target Version/s: 2.0.0

> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
>                 Key: SPARK-10788
>                 URL: https://issues.apache.org/jira/browse/SPARK-10788
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Assignee: Seth Hendrickson
>            Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed
for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6).
 However, we could instead collect statistics for the 3 subsets on the left-hand side of the
3 possible splits: A and A,B and A,C.  If we also have stats for the entire node, then we
can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath:
{{stats(B,C) = stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since the spark.mllib
implementation will be removed before long (and will instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message