mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r907091 - in /websites/staging/mahout/trunk/content: ./ users/classification/breiman-example.html
Date Sun, 27 Apr 2014 22:00:50 GMT
Author: buildbot
Date: Sun Apr 27 22:00:50 2014
New Revision: 907091

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/classification/breiman-example.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sun Apr 27 22:00:50 2014
@@ -1 +1 @@
-1590488
+1590502

Modified: websites/staging/mahout/trunk/content/users/classification/breiman-example.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/breiman-example.html (original)
+++ websites/staging/mahout/trunk/content/users/classification/breiman-example.html Sun Apr
27 22:00:50 2014
@@ -226,85 +226,60 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="BreimanExample-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
-<p>This quick start page shows how to run the Breiman example. It implements
-the test procedure described in Breiman's paper <a href="1.html">1</a>
-. 
-The basic algorithm is as follows :
-<em> repeat I iterations
-</em> foreach iteration do
- <strong> 10% of the dataset is kept apart as a testing set 
- </strong> build two forests using the training set, one with m=int(log2(M)+1)
-(called Random-Input) and one with m=1 (called Single-Input)
- <strong> choose the forest that gave the lowest oob error estimation to compute
-the test set error
- </strong> compute the test set error using the Single Input Forest (test error),
-this demonstrates that even with m=1, Decision Forests give comparable
-results to greater values of m
- <em><em> compute the mean test set error using every tree of the chosen forest
-(tree error). This should indicate how well a single Decision Tree performs
-</em> compute the mean test error for all iterations
-</em> compute the mean tree error for all iterations</p>
-<p><a name="BreimanExample-Steps"></a></p>
-<h1 id="steps">Steps</h1>
-<p><a name="BreimanExample-Downloadthedata"></a></p>
-<h2 id="download-the-data">Download the data</h2>
+    <h1 id="breiman-example">Breiman Example</h1>
+<h4 id="introduction">Introduction</h4>
+<p>This page describes how to run the Breiman example, which implements the test procedure
described in <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&amp;rep=rep1&amp;type=pdf">Leo
Breiman's paper</a>. The basic algorithm is as follows :</p>
 <ul>
-<li>The current implementation is compatible with the UCI repository file
-format. Here are links to some of the datasets used in Breiman's paper:
- <strong> glass : http://archive.ics.uci.edu/ml/datasets/Glass+Identification
- </strong> breast cancer :
-http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- <strong> diabetes : http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
- </strong> sonar :
-http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)
- <strong> ionosphere : http://archive.ics.uci.edu/ml/datasets/Ionosphere
- </strong> vehicle : <a href="http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)">http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)</a>
- ** german : <a href="http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)">http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)</a></li>
-<li>Put the data in HDFS: {code}$HADOOP_HOME/bin/hadoop fs -put <PATH TO
-DATA> testdata{code}</li>
+<li>repeat <em>I</em> iterations</li>
+<li>in each iteration do</li>
+<li>keep 10% of the dataset apart as a testing set </li>
+<li>build two forests using the training set, one with <em>m = int(log2(M) +
1)</em> (called Random-Input) and one with <em>m = 1</em> (called Single-Input)</li>
+<li>choose the forest that gave the lowest oob error estimation to compute
+the test set error</li>
+<li>compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with <em>m = 1</em>, Decision Forests give comparable
+results to greater values of <em>m</em></li>
+<li>compute the mean testset error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs</li>
+<li>compute the mean test error for all iterations</li>
+<li>compute the mean tree error for all iterations</li>
 </ul>
-<p><a name="BreimanExample-BuildtheJobfiles"></a></p>
-<h2 id="build-the-job-files">Build the Job files</h2>
-<ul>
-<li>In $MAHOUT_HOME/ run: {code}mvn install -DskipTests{code}</li>
-</ul>
-<p><a name="BreimanExample-Generateafiledescriptorforthedataset:"></a></p>
-<h2 id="generate-a-file-descriptor-for-the-dataset">Generate a file descriptor for
the dataset:</h2>
-<p>for the glass dataset (glass.data), run :</p>
-<div class="codehilite"><pre>$<span class="n">HADOOP_HOME</span><span
class="o">/</span><span class="n">bin</span><span class="o">/</span><span
class="n">hadoop</span> <span class="n">jar</span>
+<h4 id="running-the-example">Running the Example</h4>
+<p>The current implementation is compatible with the <a href="http://archive.ics.uci.edu/ml/">UCI
repository</a> file format. We'll show how to run this example on two datasets:</p>
+<p>First, we deal with <a href="http://archive.ics.uci.edu/ml/datasets/Glass+Identification">Glass
Identification</a>: download the <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data">dataset</a>
file called <strong>glass.data</strong> and store it onto your local machine.
Next, we must generate the descriptor file <strong>glass.info</strong> for this
dataset with the following command:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span
class="p">.</span><span class="n">classifier</span><span class="p">.</span><span
class="n">df</span><span class="p">.</span><span class="n">tools</span><span
class="p">.</span><span class="n">Describe</span> <span class="o">-</span><span
class="n">p</span> <span class="o">/</span><span class="n">path</span><span
class="o">/</span><span class="n">to</span><span class="o">/</span><span
class="n">glass</span><span class="p">.</span><span class="n">data</span>
<span class="o">-</span><span class="n">f</span> <span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">glass</span><span class="p">.</span><span
class="n">info</span> <span class="o">-</span><span class=
 "n">d</span> <span class="n">I</span> 9 <span class="n">N</span>
<span class="n">L</span>
 </pre></div>
 
 
-<p>$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
-org.apache.mahout.classifier.df.tools.Describe -p testdata/glass.data -f
-testdata/glass.info -d I 9 N L</p>
-<p>The "I 9 N L" string indicates the nature of the variables. which means 1
-ignored(I) attribute, followed by 9 numerical(N) attributes, followed by
-the label(L)
-* you can also use C for categorical (nominal) attributes</p>
-<p><a name="BreimanExample-Runtheexample"></a></p>
-<h2 id="run-the-example">Run the example</h2>
-<div class="codehilite"><pre>$<span class="n">HADOOP_HOME</span><span
class="o">/</span><span class="n">hadoop</span> <span class="n">jar</span>
+<p>Substitute <em>/path/to/</em> with the folder where you downloaded the
dataset, the argument "I 9 N L" indicates the nature of the variables. Here it means 1
+ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
+the label (L).</p>
+<p>Finally, we build and evaluate our random forest classifier as follows:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span
class="p">.</span><span class="n">classifier</span><span class="p">.</span><span
class="n">df</span><span class="p">.</span><span class="n">BreimanExample</span>
<span class="o">-</span><span class="n">d</span> <span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">glass</span><span class="p">.</span><span
class="n">data</span> <span class="o">-</span><span class="n">ds</span>
<span class="o">/</span><span class="n">path</span><span class="o">/</span><span
class="n">to</span><span class="o">/</span><span class="n">glass</span><span
class="p">.</span><span class="n">info</span> <span class="o">-</span><span
class="nb">i</span> 10 <span class="o">-</span><spa
 n class="n">t</span> 100
 </pre></div>
 
 
-<p>$MAHOUT_HOME/examples/target/mahout-examples-<VERSION>-job.jar
-org.apache.mahout.classifier.df.BreimanExample -d testdata/glass.data -ds
-testdata/glass.info -i 10 -t 100</p>
 <p>which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
-argument) 
-<em> The example outputs the following results:
-<strong> Selection error : mean test error for the selected forest on all
-iterations
-</strong> Single Input error : mean test error for the single input forest on all
-iterations
-<strong> One Tree error : mean single tree error on all iterations
-</strong> Mean Random Input Time : mean build time for random input forests on all
-iterations
-</em>* Mean Single Input Time : mean build time for single input forests on all
-iterations</p>
+argument) </p>
+<p>The example outputs the following results:</p>
+<ul>
+<li>Selection error: mean test error for the selected forest on all iterations</li>
+<li>Single Input error: mean test error for the single input forest on all
+iterations</li>
+<li>One Tree error: mean single tree error on all iterations</li>
+<li>Mean Random Input Time: mean build time for random input forests on all
+iterations</li>
+<li>Mean Single Input Time: mean build time for single input forests on all
+iterations</li>
+</ul>
+<p>We can repeat this for a <a href="http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29">Sonar</a>
usecase: download the <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data">dataset</a>
file called <strong>sonar.all-data</strong> and store it onto your local machine.
Generate the descriptor file <strong>sonar.info</strong> for this dataset with
the following command:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span
class="p">.</span><span class="n">classifier</span><span class="p">.</span><span
class="n">df</span><span class="p">.</span><span class="n">tools</span><span
class="p">.</span><span class="n">Describe</span> <span class="o">-</span><span
class="n">p</span> <span class="o">/</span><span class="n">path</span><span
class="o">/</span><span class="n">to</span><span class="o">/</span><span
class="n">sonar</span><span class="p">.</span><span class="n">all</span><span
class="o">-</span><span class="n">data</span> <span class="o">-</span><span
class="n">f</span> <span class="o">/</span><span class="n">path</span><span
class="o">/</span><span class="n">to</span><span class="o">/</span><span
class="n">sonar</span><span class="p">.</span><span class="n
 ">info</span> <span class="o">-</span><span class="n">d</span>
60 <span class="n">N</span> <span class="n">L</span>
+</pre></div>
+
+
+<p>The argument "60 N L" means 60 numerical(N) attributes, followed by the label (L).
Analogous to the previous case, we run the evaluation as follows:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span
class="p">.</span><span class="n">classifier</span><span class="p">.</span><span
class="n">df</span><span class="p">.</span><span class="n">BreimanExample</span>
<span class="o">-</span><span class="n">d</span> <span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">sonar</span><span class="p">.</span><span
class="n">all</span><span class="o">-</span><span class="n">data</span>
<span class="o">-</span><span class="n">ds</span> <span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">sonar</span><span class="p">.</span><span
class="n">info</span> <span class="o">-</span><span c
 lass="nb">i</span> 10 <span class="o">-</span><span class="n">t</span>
100
+</pre></div>
    </div>
   </div>     
 </div> 



Mime
View raw message