mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r908048 - in /websites/staging/mahout/trunk/content: ./ users/classification/twenty-newsgroups.html
Date Mon, 05 May 2014 04:08:39 GMT
Author: buildbot
Date: Mon May  5 04:08:39 2014
New Revision: 908048

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon May  5 04:08:39 2014
@@ -1 +1 @@
-1592031
+1592443

Modified: websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html (original)
+++ websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html Mon
May  5 04:08:39 2014
@@ -243,121 +243,198 @@
 newsgroup documents, partitioned (nearly) evenly across 20 different
 newsgroups. The 20 newsgroups collection has become a popular data set for
 experiments in text applications of machine learning techniques, such as
-text classification and text clustering. We will use Mahout Bayes
-Classifier to create a model that would classify a new document into one of
+text classification and text clustering. We will use the <a href="http://mahout.apache.org/users/classification/bayesian.html">Mahout
CBayes</a>
+classifier to create a model that would classify a new document into one of
 the 20 newsgroup.</p>
 <p><a name="TwentyNewsgroups-Prerequisites"></a></p>
-<h2 id="prerequisites">Prerequisites</h2>
+<h3 id="prerequisites">Prerequisites</h3>
 <ul>
 <li>Mahout has been downloaded (<a href="http://apache.osuosl.org/mahout/">instructions
here</a>)</li>
 <li>Maven is available</li>
-<li>Your environment has the following variables:
-<table>
-<tr><td> <em>HADOOP_HOME</em> </td><td> Environment variables
refers to where Hadoop lives </td></tr>
-<tr><td> <em>MAHOUT_HOME</em> </td><td> Environment variables
refers to where Mahout lives </td></tr>
-</table></li>
+<li>Your environment has the following variables:</li>
+<li><strong>HADOOP_HOME</strong> Environment variables refers to where
Hadoop lives </li>
+<li><strong>MAHOUT_HOME</strong> Environment variables refers to where
Mahout lives</li>
 </ul>
 <p><a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a></p>
-<h2 id="instructions-for-running-the-example">Instructions for running the example</h2>
+<h3 id="instructions-for-running-the-example">Instructions for running the example</h3>
 <ol>
 <li>
-<p>Start the hadoop daemons by executing the following commands</p>
-<p>$ cd $HADOOP_HOME/bin
-$ ./start-all.sh</p>
+<p>If running Hadoop in cluster mode, Start the hadoop daemons by executing the following
commands</p>
+<div class="codehilite"><pre>    $ <span class="n">cd</span> $<span
class="n">HADOOP_HOME</span><span class="o">/</span><span class="n">bin</span>
+    $ <span class="o">./</span><span class="n">start</span><span
class="o">-</span><span class="n">all</span><span class="p">.</span><span
class="n">sh</span>
+</pre></div>
+
+
+<p>Otherwise:</p>
+<div class="codehilite"><pre>    $ <span class="n">export</span>
<span class="n">MAHOUT_LOCAL</span><span class="p">=</span><span
class="n">true</span>
+</pre></div>
+
+
 </li>
 <li>
-<p>In the trunk directory of mahout, compile everything and create the
-mahout job:</p>
-<p>$ cd $MAHOUT_HOME
-$ mvn install</p>
+<p>In the trunk directory of mahout, compile and install mahout:</p>
+<div class="codehilite"><pre>    $ <span class="n">cd</span> $<span
class="n">MAHOUT_HOME</span>
+    $ <span class="n">mvn</span> <span class="n">install</span>
+</pre></div>
+
+
 </li>
 <li>
-<p>Run the 20 newsgroup example by executing:</p>
-<p>$ ./examples/bin/classify-20newsgroups.sh</p>
+<p>Run the <a href="http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh">20
newsgroup example script</a> by executing:</p>
+<div class="codehilite"><pre>    $ <span class="o">./</span><span
class="n">examples</span><span class="o">/</span><span class="n">bin</span><span
class="o">/</span><span class="n">classify</span><span class="o">-</span>20<span
class="n">newsgroups</span><span class="p">.</span><span class="n">sh</span>
+</pre></div>
+
+
+</li>
+<li>
+<p>You will be prompted to select a classification method algorithm: </p>
+<div class="codehilite"><pre>    1<span class="p">.</span> <span
class="n">Complement</span> <span class="n">Naive</span> <span class="n">Bayes</span>
+    2<span class="p">.</span> <span class="n">Naive</span> <span
class="n">Bayes</span>
+    3<span class="p">.</span> <span class="n">Stochastic</span> <span
class="n">Gradient</span> <span class="n">Descent</span>
+</pre></div>
+
+
 </li>
 </ol>
-<p>The script performs the following</p>
+<p>Select 1 and the the script will perform the following:</p>
 <ol>
-<li>Asks you to select an classification algorithm: Complementary Naive Bayes, Naive
Bayes or Stochastic Gradient Descent.</li>
-<li>Downloads <em>20news-bydate.tar.gz</em> from the <a href="http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz">20newsgroups
dataset</a></li>
-<li>Extracts dataset</li>
-<li>Generates input dataset for training classifier</li>
-<li>Generates input dataset for testing classifier</li>
-<li>Trains the classifier</li>
-<li>Tests the classifier</li>
+<li>Create a working directory for the dataset and all input/output.</li>
+<li>Download and extract the <em>20news-bydate.tar.gz</em> from the <a
href="http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz">20newsgroups
dataset</a> to the working directory.</li>
+<li>Convert the full 20newsgroups dataset into a &lt; Text, Text &gt; sequence
file. </li>
+<li>Convert and preprocesses the dataset into a &lt; Text, VectorWritable &gt;
sequence file containing term frequencies for each document.</li>
+<li>Split the preprocessed dataset into training and testing sets. </li>
+<li>Train the classifier.</li>
+<li>Test the classifier.</li>
 </ol>
 <p>Output might look like:</p>
-<div class="codehilite"><pre><span class="o">======================================================</span><span
class="p">=</span>
-<span class="n">Confusion</span> <span class="n">Matrix</span>
+<div class="codehilite"><pre><span class="o">=======================================================</span>
+Confusion Matrix
 <span class="o">-------------------------------------------------------</span>
-<span class="n">a</span>   <span class="n">b</span>   <span class="n">c</span>
  <span class="n">d</span>   <span class="n">e</span>   <span class="n">f</span>
  <span class="n">g</span>   <span class="n">h</span>   <span class="nb">i</span>
  <span class="nb">j</span>   <span class="n">k</span>   <span
class="n">l</span>   <span class="n">m</span>   <span class="n">n</span>
  <span class="n">o</span>   <span class="n">p</span>   <span class="n">q</span>
  <span class="n">r</span>   <span class="n">s</span>
+a   b   c   d    e   f    g   h    i   j    k   l   m   n   o   p   q   r   s   t  <span
class="o">&lt;--</span>Classified as
+<span class="m">381</span> <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">9</span>    <span class="m">1</span>   <span class="m">0</span>
   <span class="m">0</span>   <span class="m">0</span>    <span
class="m">1</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">2</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">0</span>   <span class="m">0</span>   <span class="m">3</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">398</span>
 a <span class="o">=</span> rec.motorcycles
+<span class="m">1</span>   <span class="m">284</span> <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">0</span>    <span class="m">1</span>   <span class="m">0</span>
   <span class="m">6</span>   <span class="m">3</span>    <span
class="m">11</span>  <span class="m">0</span>   <span class="m">66</span>
 <span class="m">3</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">6</span>   <span class="m">0</span>   <span class="m">4</span>
  <span class="m">9</span>    <span class="o">|</span>  <span class="m">395</span>
 b <span class="o">=</span> comp.windows.x
+<span class="m">2</span>   <span class="m">0</span>   <span class="m">339</span>
<span class="m">2</span>    <span class="m">0</span>   <span class="m">3</span>
   <span class="m">5</span>   <span class="m">1</span>    <span
class="m">0</span>   <span class="m">0</span>    <span class="m">0</span>
  <span class="m">0</span>   <span class="m">1</span>   <span class="m">1</span>
  <span class="m">12</span>  <span class="m">1</span>   <span class="m">7</span>
  <span class="m">0</span>   <span class="m">2</span>   <span class="m">0</span>
   <span class="o">|</span>  <span class="m">376</span>  c <span
class="o">=</span> talk.politics.mideast
+<span class="m">4</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">327</span>  <span class="m">0</span>   <span
class="m">2</span>    <span class="m">2</span>   <span class="m">0</span>
   <span class="m">0</span>   <span class="m">2</span>    <span
class="m">1</span>   <span class="m">1</span>   <span class="m">0</span>
  <span class="m">5</span>   <span class="m">1</span>   <span class="m">4</span>
  <span class="m">12</span>  <span class="m">0</span>   <span class="m">2</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">364</span>
 d <span class="o">=</span> talk.politics.guns
+<span class="m">7</span>   <span class="m">0</span>   <span class="m">4</span>
  <span class="m">32</span>   <span class="m">27</span>  <span
class="m">7</span>    <span class="m">7</span>   <span class="m">2</span>
   <span class="m">0</span>   <span class="m">12</span>   <span
class="m">0</span>   <span class="m">0</span>   <span class="m">6</span>
  <span class="m">0</span>   <span class="m">100</span> <span class="m">9</span>
  <span class="m">7</span>   <span class="m">31</span>  <span class="m">0</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">251</span>
 e <span class="o">=</span> talk.religion.misc
+<span class="m">10</span>  <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">359</span>  <span class="m">2</span>   <span class="m">2</span>
   <span class="m">0</span>   <span class="m">1</span>    <span
class="m">3</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">6</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">0</span>   <span class="m">0</span>   <span class="m">11</span>
 <span class="m">0</span>    <span class="o">|</span>  <span class="m">396</span>
 f <span class="o">=</span> rec.autos
+<span class="m">0</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">1</span>    <span class="m">383</span> <span class="m">9</span>
   <span class="m">1</span>   <span class="m">0</span>    <span
class="m">0</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>   <span class="m">0</span>   <span class="m">3</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">397</span>
 g <span class="o">=</span> rec.sport.baseball
+<span class="m">1</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">0</span>    <span class="m">9</span>   <span class="m">382</span>
 <span class="m">0</span>   <span class="m">0</span>    <span class="m">0</span>
  <span class="m">0</span>   <span class="m">1</span>   <span class="m">1</span>
  <span class="m">1</span>   <span class="m">0</span>   <span class="m">2</span>
  <span class="m">0</span>   <span class="m">2</span>   <span class="m">0</span>
   <span class="o">|</span>  <span class="m">399</span>  h <span
class="o">=</span> rec.sport.hockey
+<span class="m">2</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">4</span>    <span class="m">3</span>   <span class="m">0</span>
   <span class="m">330</span> <span class="m">4</span>    <span
class="m">4</span>   <span class="m">0</span>   <span class="m">5</span>
  <span class="m">12</span>  <span class="m">0</span>   <span class="m">0</span>
  <span class="m">2</span>   <span class="m">0</span>   <span class="m">12</span>
 <span class="m">7</span>    <span class="o">|</span>  <span class="m">385</span>
 i <span class="o">=</span> comp.sys.mac.hardware
+<span class="m">0</span>   <span class="m">3</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">0</span>    <span class="m">1</span>   <span class="m">0</span>
   <span class="m">0</span>   <span class="m">368</span> <span
class="m">0</span>   <span class="m">0</span>    <span class="m">10</span>
 <span class="m">4</span>   <span class="m">1</span>   <span class="m">3</span>
  <span class="m">2</span>   <span class="m">0</span>   <span class="m">2</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">394</span>
 j <span class="o">=</span> sci.space
+<span class="m">0</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">3</span>    <span class="m">1</span>   <span class="m">0</span>
   <span class="m">27</span>  <span class="m">2</span>    <span
class="m">291</span> <span class="m">0</span>   <span class="m">11</span>
 <span class="m">25</span>  <span class="m">0</span>   <span class="m">0</span>
  <span class="m">1</span>   <span class="m">0</span>   <span class="m">13</span>
 <span class="m">18</span>   <span class="o">|</span>  <span class="m">392</span>
 k <span class="o">=</span> comp.sys.ibm.pc.hardware
+<span class="m">8</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">109</span>  <span class="m">0</span>   <span
class="m">6</span>    <span class="m">11</span>  <span class="m">4</span>
   <span class="m">1</span>   <span class="m">18</span>   <span
class="m">0</span>   <span class="m">98</span>  <span class="m">1</span>
  <span class="m">3</span>   <span class="m">11</span>  <span class="m">10</span>
 <span class="m">27</span>  <span class="m">1</span>   <span class="m">1</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">310</span>
 l <span class="o">=</span> talk.politics.misc
+<span class="m">0</span>   <span class="m">11</span>  <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">3</span>    <span class="m">6</span>   <span class="m">0</span>
   <span class="m">10</span>  <span class="m">6</span>    <span
class="m">11</span>  <span class="m">0</span>   <span class="m">299</span>
<span class="m">13</span>  <span class="m">0</span>   <span class="m">2</span>
  <span class="m">13</span>  <span class="m">0</span>   <span class="m">7</span>
  <span class="m">8</span>    <span class="o">|</span>  <span class="m">389</span>
 m <span class="o">=</span> comp.graphics
+<span class="m">6</span>   <span class="m">0</span>   <span class="m">1</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">4</span>    <span class="m">2</span>   <span class="m">0</span>
   <span class="m">5</span>   <span class="m">2</span>    <span
class="m">12</span>  <span class="m">0</span>   <span class="m">8</span>
  <span class="m">321</span> <span class="m">0</span>   <span class="m">4</span>
  <span class="m">14</span>  <span class="m">0</span>   <span class="m">8</span>
  <span class="m">6</span>    <span class="o">|</span>  <span class="m">393</span>
 n <span class="o">=</span> sci.electronics
+<span class="m">2</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">0</span>    <span class="m">4</span>   <span class="m">1</span>
   <span class="m">0</span>   <span class="m">3</span>    <span
class="m">1</span>   <span class="m">0</span>   <span class="m">3</span>
  <span class="m">1</span>   <span class="m">372</span> <span class="m">6</span>
  <span class="m">0</span>   <span class="m">2</span>   <span class="m">1</span>
  <span class="m">2</span>    <span class="o">|</span>  <span class="m">398</span>
 o <span class="o">=</span> soc.religion.christian
+<span class="m">4</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">1</span>    <span class="m">0</span>   <span
class="m">2</span>    <span class="m">3</span>   <span class="m">3</span>
   <span class="m">0</span>   <span class="m">4</span>    <span
class="m">2</span>   <span class="m">0</span>   <span class="m">7</span>
  <span class="m">12</span>  <span class="m">6</span>   <span class="m">342</span>
<span class="m">1</span>   <span class="m">0</span>   <span class="m">9</span>
  <span class="m">0</span>    <span class="o">|</span>  <span class="m">396</span>
 p <span class="o">=</span> sci.med
+<span class="m">0</span>   <span class="m">1</span>   <span class="m">0</span>
  <span class="m">1</span>    <span class="m">0</span>   <span
class="m">1</span>    <span class="m">4</span>   <span class="m">0</span>
   <span class="m">3</span>   <span class="m">0</span>    <span
class="m">1</span>   <span class="m">0</span>   <span class="m">8</span>
  <span class="m">4</span>   <span class="m">0</span>   <span class="m">2</span>
  <span class="m">369</span> <span class="m">0</span>   <span class="m">1</span>
  <span class="m">1</span>    <span class="o">|</span>  <span class="m">396</span>
 q <span class="o">=</span> sci.crypt
+<span class="m">10</span>  <span class="m">0</span>   <span class="m">4</span>
  <span class="m">10</span>   <span class="m">1</span>   <span
class="m">5</span>    <span class="m">6</span>   <span class="m">2</span>
   <span class="m">2</span>   <span class="m">6</span>    <span
class="m">2</span>   <span class="m">0</span>   <span class="m">2</span>
  <span class="m">1</span>   <span class="m">86</span>  <span class="m">15</span>
 <span class="m">14</span>  <span class="m">152</span> <span class="m">0</span>
  <span class="m">1</span>    <span class="o">|</span>  <span class="m">319</span>
 r <span class="o">=</span> alt.atheism
+<span class="m">4</span>   <span class="m">0</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">9</span>    <span class="m">1</span>   <span class="m">1</span>
   <span class="m">8</span>   <span class="m">1</span>    <span
class="m">12</span>  <span class="m">0</span>   <span class="m">3</span>
  <span class="m">6</span>   <span class="m">0</span>   <span class="m">2</span>
  <span class="m">0</span>   <span class="m">0</span>   <span class="m">341</span>
<span class="m">2</span>    <span class="o">|</span>  <span class="m">390</span>
 s <span class="o">=</span> misc.forsale
+<span class="m">8</span>   <span class="m">5</span>   <span class="m">0</span>
  <span class="m">0</span>    <span class="m">0</span>   <span
class="m">1</span>    <span class="m">6</span>   <span class="m">0</span>
   <span class="m">8</span>   <span class="m">5</span>    <span
class="m">50</span>  <span class="m">0</span>   <span class="m">40</span>
 <span class="m">2</span>   <span class="m">1</span>   <span class="m">0</span>
  <span class="m">9</span>   <span class="m">0</span>   <span class="m">3</span>
  <span class="m">256</span>  <span class="o">|</span>  <span class="m">394</span>
 t <span class="o">=</span> comp.os.ms<span class="o">-</span>windows.misc
+<span class="o">=======================================================</span>
+Statistics
+<span class="o">-------------------------------------------------------</span>
+Kappa                                       <span class="m">0.8808</span>
+Accuracy                                   <span class="m">90.8596</span><span
class="o">%</span>
+Reliability                                <span class="m">86.3632</span><span
class="o">%</span>
+Reliability <span class="p">(</span>standard deviation<span class="p">)</span>
           <span class="m">0.2131</span>
 </pre></div>
 
 
-<p>t   u   &lt;--Classified as
-    381 0   0   0   0   9   1   0   0   0   1   0   0   2   0   1   0   0   3<br />
-0   0    |  398  a     = rec.motorcycles
-    1   284 0   0   0   0   1   0   6   3   11  0   66  3   0   1   6   0   4<br />
-9   0    |  395  b     = comp.windows.x
-    2   0   339 2   0   3   5   1   0   0   0   0   1   1   12  1   7   0   2<br />
-0   0    |  376  c     = talk.politics.mideast
-    4   0   1   327 0   2   2   0   0   2   1   1   0   5   1   4   12  0   2<br />
-0   0    |  364  d     = talk.politics.guns
-    7   0   4   32  27  7   7   2   0   12  0   0   6   0   100 9   7   31  0<br />
-0   0    |  251  e     = talk.religion.misc
-    10  0   0   0   0   359 2   2   0   1   3   0   1   6   0   1   0   0   11 
-0   0    |  396  f     = rec.autos
-    0   0   0   0   0   1   383 9   1   0   0   0   0   0   0   0   0   0   3<br />
-0   0    |  397  g     = rec.sport.baseball
-    1   0   0   0   0   0   9   382 0   0   0   0   1   1   1   0   2   0   2<br />
-0   0    |  399  h     = rec.sport.hockey
-    2   0   0   0   0   4   3   0   330 4   4   0   5   12  0   0   2   0   12 
-7   0    |  385  i     = comp.sys.mac.hardware
-    0   3   0   0   0   0   1   0   0   368 0   0   10  4   1   3   2   0   2<br />
-0   0    |  394  j     = sci.space
-    0   0   0   0   0   3   1   0   27  2   291 0   11  25  0   0   1   0   13 
-18  0    |  392  k     = comp.sys.ibm.pc.hardware
-    8   0   1   109 0   6   11  4   1   18  0   98  1   3   11  10  27  1   1<br />
-0   0    |  310  l     = talk.politics.misc
-    0   11  0   0   0   3   6   0   10  6   11  0   299 13  0   2   13  0   7<br />
-8   0    |  389  m     = comp.graphics
-    6   0   1   0   0   4   2   0   5   2   12  0   8   321 0   4   14  0   8<br />
-6   0    |  393  n     = sci.electronics
-    2   0   0   0   0   0   4   1   0   3   1   0   3   1   372 6   0   2   1<br />
-2   0    |  398  o     = soc.religion.christian
-    4   0   0   1   0   2   3   3   0   4   2   0   7   12  6   342 1   0   9<br />
-0   0    |  396  p     = sci.med
-    0   1   0   1   0   1   4   0   3   0   1   0   8   4   0   2   369 0   1<br />
-1   0    |  396  q     = sci.crypt
-    10  0   4   10  1   5   6   2   2   6   2   0   2   1   86  15  14  152 0<br />
-1   0    |  319  r     = alt.atheism
-    4   0   0   0   0   9   1   1   8   1   12  0   3   6   0   2   0   0   341
-2   0    |  390  s     = misc.forsale
-    8   5   0   0   0   1   6   0   8   5   50  0   40  2   1   0   9   0   3<br />
-256 0    |  394  t     = comp.os.ms-windows.misc
-    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0<br />
-0   0    |  0    u     = unknown</p>
 <p><a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a></p>
-<h2 id="complementary-naive-bayes">Complementary Naive Bayes</h2>
-<p>To Train a CBayes Classifier using bi-grams</p>
-<div class="codehilite"><pre>$<span class="o">&gt;</span> $<span
class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">trainclassifier</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="nb">i</span> 20<span class="n">news</span><span
class="o">-</span><span class="n">input</span> <span class="o">\</span>
-  <span class="o">-</span><span class="n">o</span> <span class="n">newsmodel</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="n">type</span> <span class="n">cbayes</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="n">ng</span> 2 <span class="o">\</span>
-  <span class="o">-</span><span class="n">source</span> <span
class="n">hdfs</span>
+<h2 id="end-to-end-commands-to-build-a-cbayes-model-for-20-newsgroups">End to end commands
to build a CBayes model for 20 Newsgroups:</h2>
+<p>The <a href="http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh">20
newsgroup example script</a> issues the following commands as outlined above. We can
build a CBayes classifier from the command line by following the process in the script: </p>
+<p><em>Be sure that <strong>MAHOUT_HOME</strong>/bin and <strong>HADOOP_HOME</strong>/bin
are in your <strong>$PATH</strong></em></p>
+<ol>
+<li>
+<p>Create a working directory for the dataset and all input/output.</p>
+<div class="codehilite"><pre>    $ export WORK_DIR=/tmp/mahout-work-<span
class="cp">${</span><span class="n">USER</span><span class="cp">}</span>
+    $ mkdir -p <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>
+</pre></div>
+
+
+</li>
+<li>
+<p>Download and extract the <em>20news-bydate.tar.gz</em> from the <a
href="http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz">20newsgroups
dataset</a> to the working directory.</p>
+<div class="codehilite"><pre>    $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

+        -o <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-bydate.tar.gz
+    $ mkdir -p <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-bydate
+    $ cd <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-bydate <span class="err">&amp;&amp;</span>
tar xzf ../20news-bydate.tar.gz <span class="err">&amp;&amp;</span> cd
.. <span class="err">&amp;&amp;</span> cd ..
+    $ mkdir <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-all
+    $ cp -R <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-bydate/*/* <span class="cp">${</span><span
class="n">WORK_DIR</span><span class="cp">}</span>/20news-all
+</pre></div>
+
+
+<ul>
+<li>
+<p>If you're running on a hadoop cluster</p>
+<div class="codehilite"><pre>$ hadoop dfs -put <span class="cp">${</span><span
class="n">WORK_DIR</span><span class="cp">}</span>/20news-all <span
class="cp">${</span><span class="n">WORK_DIR</span><span class="cp">}</span>/20news-all
+</pre></div>
+
+
+</li>
+</ul>
+</li>
+<li>
+<p>Convert the full 20newsgroups dataset into a &lt; Text, Text &gt; sequence
file. </p>
+<div class="codehilite"><pre>    $ mahout seqdirectory 
+        -i <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-all 
+        -o <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-seq -ow
 </pre></div>
 
 
-<p>To Test a CBayes Classifier using bi-grams</p>
-<div class="codehilite"><pre>$<span class="o">&gt;</span> $<span
class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">testclassifier</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="n">m</span> <span class="n">newsmodel</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="n">d</span> 20<span class="n">news</span><span
class="o">-</span><span class="n">input</span> <span class="o">\</span>
-  <span class="o">-</span><span class="n">type</span> <span class="n">cbayes</span>
<span class="o">\</span>
-  <span class="o">-</span><span class="n">ng</span> 2 <span class="o">\</span>
-  <span class="o">-</span><span class="n">source</span> <span
class="n">hdfs</span> <span class="o">\</span>
-  <span class="o">-</span><span class="n">method</span> <span
class="n">mapreduce</span>
+</li>
+<li>
+<p>Convert and preprocesses the dataset into  a &lt; Text, VectorWritable &gt;
sequence file containing term frequencies for each document. </p>
+<div class="codehilite"><pre>    $ mahout seq2sparse 
+        -i <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-seq 
+        -o <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-vectors
+        -lnorm 
+        -nv 
+        -wt tfidf
+</pre></div>
+
+
+<p>If we wanted to use different parsing methods or transformations on the term frequency
vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length
normalization.  See the <a href="http://mahout.apache.org/users/basics/creating-vectors-from-text.html">Creating
vectors from text</a> for a list of all se2sparse options.   </p>
+</li>
+<li>
+<p>Split the preprocessed dataset into training and testing sets.</p>
+<div class="codehilite"><pre>    $ mahout split 
+        -i <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-vectors/tfidf-vectors 
+        --trainingOutput <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-train-vectors 
+        --testOutput <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-test-vectors  
+        --randomSelectionPct 40 
+        --overwrite --sequenceFiles -xm sequential
 </pre></div>
+
+
+</li>
+<li>
+<p>Train the classifier.</p>
+<div class="codehilite"><pre>    $ mahout trainnb 
+        -i <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-train-vectors -el 
+        -o <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/model 
+        -li <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/labelindex 
+        -ow 
+        -c
+</pre></div>
+
+
+</li>
+<li>
+<p>Test the classifier.</p>
+<div class="codehilite"><pre>    $ mahhout testnb 
+        -i <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-test-vectors
+        -m <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/model 
+        -l <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/labelindex 
+        -ow 
+        -o <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-testing 
+        -c
+</pre></div>
+
+
+</li>
+</ol>
    </div>
   </div>     
 </div> 



Mime
View raw message