Random Forests (MAHOUT) edited by abdelhakim deneche
Page: http://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests
Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112565&originalVersion=3&revisedVersion=4
Content:

h3. How to grow a Decision Tree
source : \[3\]
LearnUnprunedTree(*X*,*Y*)
Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = the value of the
*j*'th attribute in the *i*'th input datapoint. Each column consists of either all real values
or all categorical values.
Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output class of the *i*'th
datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
Output: An Unpruned decision tree
If all records in *X* have identical values in all their attributes (this includes the case
where *R<2*), return a Leaf Node predicting the majority output, breaking ties randomly.
This case also includes
If all values in *Y* are the same, return a Leaf Node predicting this value as the output
Else
select *m* variables at random out of the *M* variables
For *j* = 1 .. *m*
If *j*'th attribute
is categorical
*
IG{*}{*}{~}j{~}* = IG(*Y*\*X{*}{*}{~}j{~}*) (see Information Gain)
Else (*j*'th attribute
is realvalued)
*
IG{*}{*}{~}j{~}* = IG*(*Y*\*X{*}{*}{~}j{~}*) (see Information Gain)
Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the splitting
attribute we'll use)
If *j\** is categorical then
For each value *v*
of the *j*'th attribute
Let *X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. Let *Y{*}{*}{^}v{^}*
= corresponding subset of *Y*
Let *Child{*}{*}{^}v{^}*
= LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
Return a decision tree
node, splitting on *j*'th attribute. The number of children equals the number of values of
the *j*'th attribute, and the *v*'th child is *Child{*}{*}{^}v{^}*
Else *j\** is realvalued and let *t* be the best split threshold
Let *X{*}{*}{^}LO{^}*
= subset of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = corresponding
subset of *Y*
Let *Child{*}{*}{^}LO{^}* = LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding subset of *Y*
Let *Child{*}{*}{^}HI{^}*
= LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
Return a decision tree node, splitting on *j*'th
attribute. It has two children corresponding to whether the *j*'th attribute is above or below
the given threshold.
*Note*: There are alternatives to Information Gain for splitting nodes
h3. Information gain
source : \[3\]
# h4. nominal attributes
suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
H(X)= \sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
H(Y\X=v) = the entropy of Y among only those records in which X has value v
H(Y\X) = sum{~}j~ p{~}j~ H(Y\X=v{~}j~)
IG(Y\X) = H(Y)  H(Y\X)
# h4. realvalued attributes
suppose X is real valued
define IG(Y\X:t) as H(Y)  H(Y\X:t)
define H(Y\X:t) = H(Y\X<t) P(X<t) + H(Y\X>=t) P(X>=t)
define IG*(Y\X) = max{~}t~ IG(Y\X:t)
h3. How to grow a Random Forest
source : \[1\]
Each tree is grown as follows:
# if the number of cases in the training set is *N*, sample *N* cases at random \but with
replacement, from the original data. This sample will be the training set for the growing
tree.
# if there are *M* input variables, a number *m << M* is specified such that at each
node, *m* variables are selected at random out of the *M* and the best split on these *m*
is used to split the node. The value of *m* is held constant during the forest growing.
# each tree is grown to its large extent possible. There is no pruning.
h3. Random Forest parameters
source : \[2\]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine
are the number of trees to be used and the number of variables (*m*) to be randomly selected
from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root
of the number of variables for *m*.
h3. How to predict the label of a case
Classify(*node*,*V*)
Input: *node* from the decision tree, if *node.attribute =
j* then the split is done on the *j*'th attribute
Input: *V* a vector of *M* columns where *V{*}{*}{~}j{~}* =
the value of the *j*'th attribute.
Output: label of *V*
If *node* is a Leaf then
Return the value
predicted by *node*
Else
Let *j = node.attribute*
If *j* is categorical then
Let *v* = *V{*}{*}{~}j{~}*
Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's value *v*
Return Classify(*child{*}{*}{^}v{^}*,*V*)
Else *j* is realvalued
Let *t = node.threshold* (split threshold)
If Vj < t then
Let *child{*}{*}{^}LO{^}*
= child node corresponding to (*<t*)
Return Classify(*child{*}{*}{^}LO{^}*,*V*)
Else
Let *child{*}{*}{^}HI{^}*
= child node corresponding to (*>=t*)
Return Classify(*child{*}{*}{^}HI{^}*,*V*)
h3. The out of bag (oob) error estimation
source : \[1\]
in random forests, there is no need for crossvalidation or a separate test set to get an
unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
* each tree is constructed using a different bootstrap sample from the original data. About
onethird of the cases left of the bootstrap sample and not used in the construction of the
_kth_ tree.
* put each case left out in the construction of the _kth_ tree down the _kth{_}tree to get
a classification. In this way, a test set classification is obtained for each case in about
onethrid of the trees. At the end of the run, take *j* to be the class that got mort of the
the votes every time case *n* was _oob_. The proportion of times that *j* is not equal to
the true class of *n* averaged over all cases is the _oob error estimate_. This has proven
to be unbiased in many tests.
h3. Other RF uses
source : \[1\]
* variable importance
* gini importance
* proximities
* scaling
* prototypes
* missing values replacement for the training set
* missing values replacement for the test set
* detecting mislabeled cases
* detecting outliers
* detecting novelties
* unsupervised learning
* balancing prediction error
Please refer to \[1\] for a detailed description
h3. References
\[1\] Random Forests  Classification Description
[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
\[2\] B. LariviÃ¨re & D. Van Den Poel, 2004. "Predicting Customer Retention
and Profitability by Using Random Forests and Regression Forests Techniques,"
Working Papers of Faculty
of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University,
Faculty of Economics
and Business Administration.
Available online :
[http://ideas.repec.org/p/rug/rugwps/04282.html]
\[3\] Decision Trees  Andrew W. Moore\[4\]
http://www.cs.cmu.edu/~awm/tutorials\[1\]
\[4\] Information Gain  Andrew W. Moore
[http://www.cs.cmu.edu/~awm/tutorials]

CONFLUENCE INFORMATION
This message is automatically generated by Confluence
Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action
If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action
If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence
