mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <a_dene...@yahoo.fr>
Subject Re: [gsoc] random forests
Date Wed, 01 Apr 2009 07:13:05 GMT

Should be visible now on both the Gsoc app. and the Apache Wiki

--- En date de : Mer 1.4.09, Ted Dunning <ted.dunning@gmail.com> a écrit :

> De: Ted Dunning <ted.dunning@gmail.com>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Mercredi 1 Avril 2009, 3h30
> Deneche,
> 
> I don't see your application on the GSOC web site. 
> Nor on the apache wiki.
> 
> Time is running out and I would hate to not see you in the
> program.  Is it
> just that I can't see the application yet?
> 
> On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim <a_deneche@yahoo.fr>wrote:
> 
> >
> > Here is a draft of my proposal
> >
> > **************************************************
> > Title/Summary: [Apache Mahout] Implement parallel
> Random/Regression Forests
> >
> > Student: AbdelHakim Deneche
> > Student e-mail: ...
> >
> > Student Major: Phd in Computer Science
> > Student Degree: Master in Computer Science
> > Student Graduation: Spring 2011
> >
> > Organization: The Apache Software Foundation
> > Assigned Mentor:
> >
> >
> > Abstract:
> >
> > My goal is to add the power of random/regression
> forests to Mahout. At the
> > end of this summer one should be able to build
> random/regression forests for
> > large, possibly, distributed datasets, store the
> forest and reuse it to
> > classify new data. In addition, a demo on EC2 is
> planned.
> >
> >
> > Detailed Description:
> >
> > This project is all about random/regression forests.
> The core component is
> > the tree building algorithm from a random bootstrap
> from the whole dataset.
> > I already wrote a detailed description on Mahout Wiki
> [RandomForests]. Given
> > the size of the dataset, two distributed
> implementation are possible:
> >
> > 1. The most straightforward one deals with relatively
> small datasets. By
> > small, I mean a dataset that can be replicated on
> every node of the cluster.
> > Basically, each mapper has access to the whole
> dataset, so if the forest
> > contains N trees and we have M mappers, each mapper
> runs the core building
> > algorithm N/M times. This implementation is,
> relatively, easy because each
> > mapper runs the basic building algorithm "as it is".
> It is also of great
> > interest if the user wants to "try" different
> parameters when building the
> > forest. An out-of-core implementation is also possible
> to deal with datasets
> > that cannot fit into the node memory.
> >
> > 2. The second implementation, which is the most
> difficult, is concerned
> > with very large datasets that cannot fit in every
> machine of the cluster. In
> > this case the mappers work differently, each mapper
> has access to a subset
> > from the dataset, thus all the mappers collaborate to
> build each tree of the
> > forest. The core building algorithm must thus be
> rewritten in a map-reduce
> > form. This implementation can deal with datasets of
> any size, as long as
> > they are on the cluster.
> >
> > Although the first implementation is easier to
> implement, the CPU and IO
> > overhead of the out-of-core implementation are still
> unknown. A reference,
> > non-parallel, implementation should thus be built to
> better understand the
> > effects of the out-of-core implementation, especially
> for large datasets.
> > This reference implementation is also usefull to asses
> the correctness of
> > the distributed implementation.
> >
> >
> > Working Plan and list of deliverables
> >
> > Must-Have:
> > 1. reference implementation of Random/Regression
> Forests Building
> > Algorithm:
> >  . Build a forest of trees, the basic algorithm
> (described in the wiki)
> > takes a subset from the dataset as a training set and
> builds a decision
> > tree. This algorithm is repeated for each tree of the
> forest.
> >  . The forest is stored in a file, this way it
> can be re-used, at any time,
> > to classify new cases
> >  . At this step, the necessary changes to
> Mahout's Classifier interface are
> > made to extend its use to more than Text datasets.
> >
> > 2. Study the effects of large datasets on the
> reference implementation
> >  . This step should guide our choice of the
> proper parallel implementation
> >
> > 3. Parallel implementation, choose one of the
> following:
> >  3a. Parallel implementation A
> >  . When the dataset can be replicated to all
> computing nodes.
> >  . Each mapper has access to the whole dataset,
> if the forest contains N
> > trees and we have M mappers, each mapper runs the
> basic building algorithm
> > N/M times. The mapper if also responsible of computing
> the out-of-bag error
> > estimation.
> >  . The reducer store the trees in the RF file,
> and merges the oob error
> > estimations.
> >  3b. Parallel implementation B:
> >  . When the dataset is so big that it can no
> longer fit on every computing
> > node, it must be distributed over the cluster.
> >  . Each mapper has access to a subset from the
> dataset, thus all the
> > mappers collaborate to build each tree of the forest.
> >  . In this case, the basic algorithm must be
> rewritten to fit in the
> > map-reduce paradigm.
> >
> > Should-Have:
> > 4. Run the Random Forest with a real dataset on EC2:
> >  . This step is important, because running the RF
> on a local dual core
> > machine is different from running it on a real cluster
> with a real dataset.
> >  . This can make a good demo for Mahout
> >  . Amazon has put some interesting datasets to
> play with [PublicDatasets].
> >   The US Census dataset comes in
> various sizes ranging from 2Go to 200Go,
> > and should make a very good example.
> >  . At this stage it may be useful to implement
> [MAHOUT-71] (Dataset to
> > Matrix Reader).
> >
> > Wanna-Have:
> > 5. If there is still time, implement one or two other
> important features of
> > RFs such as Variable importance and Proximity
> estimation
> >
> >
> > Additional Information:
> > I am a PhD student at the University Mentouri of
> Constantine. My primary
> > research goal is a framework to help build Intelligent
> Adaptive Systems. For
> > the purpose of my Master, I worked on Artificial
> Immune Systems. I applied
> > them to handwritten digits recognition
> [PatternRecognition] and Muliple
> > Sequence Alignement (bioinformatics)
> [BioInformatics].
> > Last year I participated in the GSoC as an Apache
> student, I integrated an
> > Evolutionary Computing Framework to Mahout
> [Mahout-56].
> >
> > References
> >
> > [BioInformatics] A. Layeb, A. Deneche, "Multiple
> Sequence Alignment by
> > Immune Artificial System",
> > ACS/IEEE International Conference on Computer Systems
> and Applications
> > AICCSA’07, Jordan 2007.
> >
> > [Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56
> >
> > [Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71
> >
> > [PatternRecognition] S. Meshoul, A. Deneche, M.
> Batouche, "Combining an
> > Artificial Immune System with a Clustering Method for
> Effective Pattern
> > Recognition",
> > International Conference on Machine Intelligence
> ICMI’05, pp. 782-787,
> > Tunis 2005.
> >
> > [PublicDatasets] http://aws.amazon.com/publicdatasets/
> >
> > [RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html
> >
> >
> ***************************************************************************
> >
> >
> >
> >
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 


      

Mime
View raw message