[ https://issues.apache.org/jira/browse/SPARK10788?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Joseph K. Bradley updated SPARK10788:

Description:
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for
unordered categorical features. Here's an example.
Say there are 3 categories A, B, C. We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However,
we could instead collect statistics for the 3 subsets on the lefthand side of the 3 possible
splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute
the stats for the 3 subsets on the righthand side of the splits. In pseudomath: {{stats(B,C)
= stats(A,B,C)  stats(A)}}.
We should eliminate these extra bins within the spark.ml implementation since the spark.mllib
implementation will be removed before long (and will instead call into spark.ml).
was:
Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each
split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C
This means we communicate twice as much data as needed for these features.
We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib
implementation will be removed before long (and will instead call into spark.ml).
> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK10788
> URL: https://issues.apache.org/jira/browse/SPARK10788
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed
for unordered categorical features. Here's an example.
> Say there are 3 categories A, B, C. We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6).
However, we could instead collect statistics for the 3 subsets on the lefthand side of the
3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we
can compute the stats for the 3 subsets on the righthand side of the splits. In pseudomath:
{{stats(B,C) = stats(A,B,C)  stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since the spark.mllib
implementation will be removed before long (and will instead call into spark.ml).

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
