spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-3156) DecisionTree: Order categorical features adaptively
Date Mon, 08 Sep 2014 16:49:30 GMT

     [ https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng resolved SPARK-3156.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.2.0

> DecisionTree: Order categorical features adaptively
> ---------------------------------------------------
>
>                 Key: SPARK-3156
>                 URL: https://issues.apache.org/jira/browse/SPARK-3156
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>             Fix For: 1.2.0
>
>
> Improvement: accuracy
> Currently, ordered categorical features use a fixed bin ordering chosen before training
based on a subsample of the data.  (See the code using centroids in findSplitsBins().)
> Proposal: Choose the ordering adaptively for every split.  This would require a bit more
computation on the master, but could improve results by splitting more intelligently.
> Required changes: The result of aggregation is used in findAggForOrderedFeatureClassification()
to compute running totals over the pre-set ordering of categorical feature values.  The stats
should instead be used to choose a new ordering of categories, before computing running totals.
> Clarification:
> It is actually more accurate to choose a new ordering at every node (and is required
to make this have guarantees and not be a heuristic for regression and binary classification).
 It does mean a different set of splits may be considered at each node, but that split should
be tailored specifically for that node and should give better results.
> As far as computation, it does require a sort, but that should be cheap as long as the
number of categories for any feature is not too large.  In my tests, much more (10x - 100x)
time is spent on the aggregation than on the master, so it is not an issue for categorical
features with a smallish number of categories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message