mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <a_dene...@yahoo.fr>
Subject Re : Reg: Random Forest in mahout 0.3
Date Tue, 15 Jun 2010 06:47:54 GMT
Q1. the "InMem Mapred Implementation" should be used when the whole dataset can fin into memory
(inmem), in this case every mapper will train a subset of the trees over the whole dataset.

the "Partial Mapred Implementation" should be used if the dataset is big enough and cannot
fit into memory or that training the trees over the entire dataset takes forever. This implementation
splits the training into as many partitions as the available mappers. Each mapper will grow
a subset of trees using its partition (partial access to the training data).
If possible you should use the "InMem Mapred Implementation" because every tree is grown using
all available training data, but if you are using Mahout it's probably because the training
data are big, so you have no other choice than using "Partial Mapred Implementation". That's
being said, I've found that the partial implementation gives similar results to the Inmem
implementation and works a lot faster because each mapper uses a subset of the training data.

Q2. Yes, all the attributes are considered at each node

Q3. The current implementation uses Information Gain to select the best split at each node.
But the code is modular enough and allows you to use your own TreeBuilder when growing the
trees (of course, for now only one implementation of TreeBuilder is available)

--- En date de : Lun 14.6.10, Karan Jindal <karan_jindal@students.iiit.ac.in> a écrit :

> De: Karan Jindal <karan_jindal@students.iiit.ac.in>
> Objet: Reg: Random Forest in mahout 0.3
> À: user@mahout.apache.org
> Date: Lundi 14 juin 2010, 14h01
> 
> Hi all,
> I  have few questions about random forest. Can any one
> through light on
> the following questions?
> 
> Q1.what's the difference between "InMem Mapred
> implementation" and
> "Partial Mapred implementation"? Is there any performance
> (in terms of
> efficiency of random forest) trade off between the two?
> 
> Q2.In training total number of attributes are 18 and by
> mistake I gave 20
> (-sl 20) attributes in command line during training phase.
> In this case,
> do the implementation consider all the attributes while
> taking decision at
> a node?
> 
> Q3. which approach (information gain or entropy model)is
> used to classify
> the data at a given node?
> 
> --Karan
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 


      

Mime
View raw message