spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Resolved] (SPARK-3156) DecisionTree: Order categorical features adaptively
Date Mon, 08 Sep 2014 16:49:30 GMT


Xiangrui Meng resolved SPARK-3156.
       Resolution: Fixed
    Fix Version/s: 1.2.0

> DecisionTree: Order categorical features adaptively
> ---------------------------------------------------
>                 Key: SPARK-3156
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>             Fix For: 1.2.0
> Improvement: accuracy
> Currently, ordered categorical features use a fixed bin ordering chosen before training
based on a subsample of the data.  (See the code using centroids in findSplitsBins().)
> Proposal: Choose the ordering adaptively for every split.  This would require a bit more
computation on the master, but could improve results by splitting more intelligently.
> Required changes: The result of aggregation is used in findAggForOrderedFeatureClassification()
to compute running totals over the pre-set ordering of categorical feature values.  The stats
should instead be used to choose a new ordering of categories, before computing running totals.
> Clarification:
> It is actually more accurate to choose a new ordering at every node (and is required
to make this have guarantees and not be a heuristic for regression and binary classification).
 It does mean a different set of splits may be considered at each node, but that split should
be tailored specifically for that node and should give better results.
> As far as computation, it does require a sort, but that should be cheap as long as the
number of categories for any feature is not too large.  In my tests, much more (10x - 100x)
time is spent on the aggregation than on the master, so it is not an issue for categorical
features with a smallish number of categories.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message