mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <a_dene...@yahoo.fr>
Subject [GSOC] Thoughts about Random forests map-reduce implementation
Date Wed, 17 Jun 2009 10:45:30 GMT

As we talked about in the following discussion (A), I'm considering two ways to implement
a distributed map-reduce builder.

Given the reference implementation, the easiest implementation is the following:

* the data is distributed to the slave nodes using the DistributedCache
* each mapper loader the data in-memory when in JobConfigurable.configure()
* each tree is built by one mapper
* the Job doesn't really need any input data, it may be possible to implement our own InputFormat
that generates InputSplit s using the configuration parameters (number of trees)
* the mapper uses DecisionTree.toString() to output the tree in a String
* the main program builds the forest using DecisionTree.parse(String) for each tree

Pros:
* the easiest implementation because the actual ref. code can be used as it is, and the distributed
implementation can benefit from future optimizations of the ref. code

Cons:
* because its based on the ref. implementation, it will be very slow when dealing with large
datasets. For example, with half of the KDD 99 dataset (a file of about 350 Mb), building
one single tree will took more than 12 hours (in fact I stopped the program after 12 hours)
in a core 2 2Ghz with 3 Gb of RAM laptop !!!
* if the slave nodes contain many computing cores it would be interesting to launch parallel
mappers in every slave node to exploit the multi-threading. But as I didn't found a way to
share a memory variable between the mappers, each mapper must load its own copy of the Data
in memory !

Important:
* The ref. implementation memory usage will probably change when the Information Gain computing
will be optimized, in this case an out-of-core approach could become viable

So I'm asking, what to do next ?
* go on with this implementation
* optimize the IG computing
* consider the "Distributed Implementation B", see (A), which deals with BIG datasets and
should not need any IG optimization, but its code has nearly nothing to do with the ref. implementation

(A) http://markmail.org/message/mgap2nuhnl4kokeu


      

Mime
View raw message