mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features
Date Tue, 25 Feb 2014 15:28:25 GMT


Hudson commented on MAHOUT-1419:

SUCCESS: Integrated in Mahout-Quality #2492 (See [])
MAHOUT-1419: Random decision forest is excessively slow on numeric features (srowen: rev 1571704)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/
* /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/split/
* /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/tools/
* /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/

> Random decision forest is excessively slow on numeric features
> --------------------------------------------------------------
>                 Key: MAHOUT-1419
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.7, 0.8, 0.9
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 1.0
>         Attachments: MAHOUT-1419.patch,,
> Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably
long time on about 2GB of data -- like, >24 hours when other RDF M/R implementations take
9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got
that down to about 5 hours. I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A split is
tried for every distinct numeric value of the feature in the whole data set. Since these are
floating point values, they could (and in the customer's case are) all distinct. 200K rows
means 200K splits to evaluate every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate only those
as splits. Really doing that efficiently would require a lot of rewrite. However, there are
some modest changes possible which get some of the benefit, and appear to make it run about
3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric
features which are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical features, but
that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating only a
sample of splits instead of every single possible one. In particular it makes the output of
"OptIgSplit" no longer match the "DefaultIgSplit". Although I think the point is that "optimized"
may mean giving different choices of split here, which could yield differing trees. So that
test probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the code that
added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.

This message was sent by Atlassian JIRA

View raw message