mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Lucene Mahout: Random Forests (page edited)
Date Tue, 17 Mar 2009 09:21:02 GMT
Random Forests (MAHOUT) edited by abdelhakim deneche


h3. How to grow a Decision Tree

source : \[3\]


Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = the value of the
*j*'th attribute in the *i*'th input datapoint. Each column consists of either all real values
or all categorical values.
Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output class of the *i*'th
datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
Output: An Unpruned decision tree

If all records in *X* have identical values in all their attributes (this includes the case
where *R<2*), return a Leaf Node predicting the majority output, breaking ties randomly.
This case also includes
If all values in *Y* are the same, return a Leaf Node predicting this value as the output
&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute
is categorical
*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute
is real-valued)
IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the splitting
attribute we'll use)
&nbsp;&nbsp;&nbsp; If *j\** is categorical then
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v*
of the *j*'th attribute
Let *X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. Let *Y{*}{*}{^}v{^}*
= corresponding subset of *Y*
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}*
= LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree
node, splitting on *j*'th attribute. The number of children equals the number of values of
the *j*'th attribute, and the *v*'th child is *Child{*}{*}{^}v{^}*
&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split threshold
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}*
= subset of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = corresponding
subset of *Y*
&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* = LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding subset of *Y*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}*
= LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on *j*'th
attribute. It has two children corresponding to whether the *j*'th attribute is above or below
the given threshold.

*Note*: There are alternatives to Information Gain for splitting nodes

h3. Information gain

source : \[3\]
# h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
H(Y\|X=v) = the entropy of Y among only those records in which X has value v
H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
IG(Y\|X) = H(Y) - H(Y\|X)
# h4. real-valued attributes

suppose X is real valued
define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)

h3. How to grow a Random Forest

source : \[1\]

Each tree is grown as follows:
# if the number of cases in the training set is *N*, sample *N* cases at random \-but with
replacement, from the original data. This sample will be the training set for the growing
# if there are *M* input variables, a number *m << M* is specified such that at each
node, *m* variables are selected at random out of the *M* and the best split on these *m*
is used to split the node. The value of *m* is held constant during the forest growing.
# each tree is grown to its large extent possible. There is no pruning.

h3. Random Forest parameters

source : \[2\]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine
are the number of trees to be used and the number of variables (*m*) to be randomly selected
from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root
of the number of variables for *m*.

h3. How to predict the label of a case

&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute =
j* then the split is done on the *j*'th attribute

&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where *V{*}{*}{~}j{~}* =
the value of the *j*'th attribute.
&nbsp;&nbsp;&nbsp; Output: label of *V*

&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value
predicted by *node*

&nbsp;&nbsp; &nbsp;Else
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Let *j = node.attribute*
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
If *j* is categorical then
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Let *v* = *V{*}{*}{~}j{~}*
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's value *v*
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)

&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Else *j* is real-valued
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Let *t = node.threshold* (split threshold)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}*
= child node corresponding to (*<t*)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}LO{^}*,*V*)
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}*
= child node corresponding to (*>=t*)
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;
Return Classify(*child{*}{*}{^}HI{^}*,*V*)

h3. The out of bag (oob) error estimation

source : \[1\]

in random forests, there is no need for cross-validation or a separate test set to get an
unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
* each tree is constructed using a different bootstrap sample from the original data. About
one-third of the cases left of the bootstrap sample and not used in the construction of the
_kth_ tree.
* put each case left out in the construction of the _kth_ tree down the _kth{_}tree to get
a classification. In this way, a test set classification is obtained for each case in about
one-thrid of the trees. At the end of the run, take *j* to be the class that got mort of the
the votes every time case *n* was _oob_. The proportion of times that *j* is not equal to
the true class of *n* averaged over all cases is the _oob error estimate_. This has proven
to be unbiased in many tests.

h3. Other RF uses

source : \[1\]
* variable importance
* gini importance
* proximities
* scaling
* prototypes
* missing values replacement for the training set
* missing values replacement for the test set
* detecting mislabeled cases
* detecting outliers
* detecting novelties
* unsupervised learning
* balancing prediction error
Please refer to \[1\] for a detailed description

h3. References

\[1\]&nbsp; Random Forests - Classification Description
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[]
\[2\]&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention
and Profitability by Using Random Forests and Regression Forests Techniques,"
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty
of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics
and Business Administration.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online :
\[3\]&nbsp; Decision Trees - Andrew W. Moore\[4\]
&nbsp; &nbsp; &nbsp; &nbsp;\[1\]
\[4\]&nbsp; Information Gain - Andrew W. Moore
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; []

This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences

If you think it was sent incorrectly contact one of the administrators

If you want more information on Confluence, or have a bug to report see

View raw message