mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <a_dene...@yahoo.fr>
Subject Re: [gsoc] random forests
Date Sat, 28 Mar 2009 15:14:03 GMT

I'm actually writing my working plan, and it looks like this:

*****************************************************************
1. reference implementation of Random/Regression Forests Building Algorithm: 
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from
the dataset as a training set and builds a decision tree. This basic algorithm is repeated
for each tree of the forest. 
 . The forest is stored in a file, this way it can be used later to classify new cases

2a. distributed Implementation A: 
 . When the dataset can be replicated to all computing nodes.
 . Each mapper has access to the whole dataset, if the forest contains N trees and we have
M mappers, each mapper runs the basic building algorithm N/M times.
 . This implementation is, relatively, given that the reference implementation is available,
because each mapper runs the basic building algorithm "as it is".

2b. Distributed Implementation B:
 . When the dataset is so big that it can no longer fit on every computing node, it must be
distributed over the cluster. 
 . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to
build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm.

3. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine is way different
from running it on a real cluster with a real dataset.
 . This can make for a good demo for Mahout

4. If there is still time, implement one or two other important features of RFs such as Variable
importance and Proximity estimation
*****************************************************************

It is clear from the plan that I won't be able to do all those steps, and in some way I must
choose only one implementation (2a or 2b) to do. The first implementation should take less
time to implement than 2b and I'm quite sure I can go up to the 4th step, adding other features
to the RF. BUT the second implementation is the only one capable of dealing with very large
distributed datasets.

My question is : when Mahout.RF will be used in a real application, what are the odds that
the dataset will be so large that it can't fit on every machine of the cluster ? 

the answer to this question should help me decide which implementation I'll choose.

--- En date de : Dim 22.3.09, Ted Dunning <ted.dunning@gmail.com> a écrit :

> De: Ted Dunning <ted.dunning@gmail.com>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Dimanche 22 Mars 2009, 0h36
> Great expression!
> 
> You may be right about the nose-bleed tendency between the
> two methods.
> 
> On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim <a_deneche@yahoo.fr>wrote:
> 
> > I can't find a no-nose-bleeding algorithm
> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 


      

Mime
View raw message