Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 5 Nov 2014 12:26:33 +0000 (UTC)
From: "Vincent Botta (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12752485.1415040054000.422536.1415190393722@Atlassian.JIRA>
In-Reply-To: <JIRA.12752485.1415040054000@Atlassian.JIRA>
References: <JIRA.12752485.1415040054000@Atlassian.JIRA>
 <JIRA.12752485.1415040054097@arcas>
Subject: [jira] [Comment Edited] (SPARK-4210) Add Extra-Trees algorithm to
 MLlib
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/SPARK-4210?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1419=
8321#comment-14198321 ]=20

Vincent Botta edited comment on SPARK-4210 at 11/5/14 12:25 PM:
----------------------------------------------------------------

[~manishamde]: Indeed it will lead to interesting implementation tradeoffs.=
 There are two levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning=
 it cannot lead to an empty partition) random threshold instead of searchin=
g for THE best one, which has an (positive) impact on algorithm complexity.=
 I=E2=80=99m not aware of the many possible ways to do that with Spark,  bu=
t I suppose this can be done in many ways. We will have to evaluate the dif=
ferent strategies and see what=E2=80=99s best given the different scenarios=
. Not sure yet how I plan to do this, I will need to do some more digging i=
nto the current MLLib code. Suggestions are welcome.
- *level 2*: among the ones we picked at random, evaluate the one that maxi=
mize a given score. I guess that can be done as in the current Random Fores=
t (RF). ATM I propose to rely on what has been done in the RF. We will see =
where that leads us.

Here is a link to the original [Extremely randomized trees article|http://w=
ww.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees=
.pdf]. Even though I see the ensemble of decision tree methods as a more ge=
neral framework in which there are many little boxes that can be fine tuned=
. See [this flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-t=
ree.pdf?dl=3D0] where each boxes corresponds to steps that can be customize=
d/particularized to produce Single Decision Tree, Random Forests, Extra-Tre=
es or whatever that will suit your needs.


was (Author: 0asa):
[~manishamde]: Indeed it will lead to interesting implementation tradeoffs.=
 There are two levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning=
 it cannot lead to an empty partition) random threshold instead of searchin=
g for THE best one, which has an (positive) impact on algorithm complexity.=
 I=E2=80=99m not aware of the many possible ways to do that with Spark,  bu=
t I suppose this can be done in many ways. We will have to evaluate the dif=
ferent strategies and see what=E2=80=99s best given the different scenarios=
. Not sure yet how I plan to do this, I will need to do some more digging i=
nto the current MLLib code.
- *level 2*: among the ones we picked at random, evaluate the one that maxi=
mize a given score. I guess that can be done as in the current Random Fores=
t (RF). ATM I propose to rely on what has been done in the RF. We will see =
where that leads us.

Here is a link to the original [Extremely randomized trees article|http://w=
ww.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees=
.pdf]. Even though I see the ensemble of decision tree methods as a more ge=
neral framework in which there are many little boxes that can be fine tuned=
. See [this flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-t=
ree.pdf?dl=3D0] where each boxes corresponds to steps that can be customize=
d/particularized to produce Single Decision Tree, Random Forests, Extra-Tre=
es or whatever that will suit your needs.

> Add Extra-Trees algorithm to MLlib
> ----------------------------------
>
>                 Key: SPARK-4210
>                 URL: https://issues.apache.org/jira/browse/SPARK-4210
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent Botta
>
> This task will add Extra-Trees support to Spark MLlib. The implementation=
 could be inspired from the current Random Forest algorithm. This algorithm=
 is expected to be particularly suited as sorting of attributes is not requ=
ired as opposed to to the original Random Forest approach (with similar and=
/or better predictive power).=20
> The tasks involves:
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org