spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Botta (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-4210) Add Extra-Trees algorithm to MLlib
Date Wed, 05 Nov 2014 12:26:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198321#comment-14198321
] 

Vincent Botta edited comment on SPARK-4210 at 11/5/14 12:25 PM:
----------------------------------------------------------------

[~manishamde]: Indeed it will lead to interesting implementation tradeoffs. There are two
levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning it cannot lead
to an empty partition) random threshold instead of searching for THE best one, which has an
(positive) impact on algorithm complexity. I’m not aware of the many possible ways to do
that with Spark,  but I suppose this can be done in many ways. We will have to evaluate the
different strategies and see what’s best given the different scenarios. Not sure yet how
I plan to do this, I will need to do some more digging into the current MLLib code. Suggestions
are welcome.
- *level 2*: among the ones we picked at random, evaluate the one that maximize a given score.
I guess that can be done as in the current Random Forest (RF). ATM I propose to rely on what
has been done in the RF. We will see where that leads us.

Here is a link to the original [Extremely randomized trees article|http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf].
Even though I see the ensemble of decision tree methods as a more general framework in which
there are many little boxes that can be fine tuned. See [this flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-tree.pdf?dl=0]
where each boxes corresponds to steps that can be customized/particularized to produce Single
Decision Tree, Random Forests, Extra-Trees or whatever that will suit your needs.


was (Author: 0asa):
[~manishamde]: Indeed it will lead to interesting implementation tradeoffs. There are two
levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning it cannot lead
to an empty partition) random threshold instead of searching for THE best one, which has an
(positive) impact on algorithm complexity. I’m not aware of the many possible ways to do
that with Spark,  but I suppose this can be done in many ways. We will have to evaluate the
different strategies and see what’s best given the different scenarios. Not sure yet how
I plan to do this, I will need to do some more digging into the current MLLib code.
- *level 2*: among the ones we picked at random, evaluate the one that maximize a given score.
I guess that can be done as in the current Random Forest (RF). ATM I propose to rely on what
has been done in the RF. We will see where that leads us.

Here is a link to the original [Extremely randomized trees article|http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf].
Even though I see the ensemble of decision tree methods as a more general framework in which
there are many little boxes that can be fine tuned. See [this flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-tree.pdf?dl=0]
where each boxes corresponds to steps that can be customized/particularized to produce Single
Decision Tree, Random Forests, Extra-Trees or whatever that will suit your needs.

> Add Extra-Trees algorithm to MLlib
> ----------------------------------
>
>                 Key: SPARK-4210
>                 URL: https://issues.apache.org/jira/browse/SPARK-4210
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent Botta
>
> This task will add Extra-Trees support to Spark MLlib. The implementation could be inspired
from the current Random Forest algorithm. This algorithm is expected to be particularly suited
as sorting of attributes is not required as opposed to to the original Random Forest approach
(with similar and/or better predictive power). 
> The tasks involves:
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message