mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r887399 - in /websites/staging/mahout/trunk/content: ./ users/classification/wikipedia-bayes-example.html
Date Wed, 20 Nov 2013 20:14:20 GMT
Author: buildbot
Date: Wed Nov 20 20:14:19 2013
New Revision: 887399

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/classification/wikipedia-bayes-example.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 20:14:19 2013
@@ -1 +1 @@
-1543924
+1543926

Modified: websites/staging/mahout/trunk/content/users/classification/wikipedia-bayes-example.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/wikipedia-bayes-example.html
(original)
+++ websites/staging/mahout/trunk/content/users/classification/wikipedia-bayes-example.html
Wed Nov 20 20:14:19 2013
@@ -381,7 +381,8 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="WikipediaBayesExample-Intro"></a></p>
+    <h1 id="naive-bayes-wikipedia-example">Naive Bayes Wikipedia Example</h1>
+<p><a name="WikipediaBayesExample-Intro"></a></p>
 <h1 id="intro">Intro</h1>
 <p>The Mahout Examples source comes with tools for classifying a Wikipedia
 data dump using either the Naive Bayes or Complementary Naive Bayes
@@ -392,33 +393,47 @@ what country an unseen article should be
 <p><a name="WikipediaBayesExample-Runningtheexample"></a></p>
 <h1 id="running-the-example">Running the example</h1>
 <ol>
-<li>download the wikipedia data set <a href="-http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.html">here
</a></li>
-<li>unzip the bz2 file to get the enwiki-latest-pages-articles.xml. </li>
-<li>Create directory $MAHOUT_HOME/examples/temp and copy the xml file into
-this directory</li>
-<li>Chunk the Data into pieces: {code}$MAHOUT_HOME/bin/mahout
-wikipediaXMLSplitter -d
-$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
-wikipedia/chunks -c 64{code} {quote}<em>We strongly suggest you backup the
-results to some other place so that you don't have to do this step again in
-case it gets accidentally erased</em>{quote}</li>
-<li>This would have created the chunks in HDFS. Verify the same by executing
-{code}hadoop fs -ls wikipedia/chunks{code} and it'll list all the xml files
-as chunk-0001.xml and so on.</li>
+<li>
+<p>Download the wikipedia data set <a href="http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.html">here</a>.</p>
+</li>
+<li>
+<p>Unzip the bz2 file to get the enwiki-latest-pages-articles.xml. </p>
+</li>
+<li>
+<p>Create directory <code>$MAHOUT_HOME/examples/temp</code> and copy the
xml file into this directory</p>
+</li>
+<li>
+<p>Chunk the Data into pieces: <code>$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter
-d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64</code>

+<em>We strongly suggest you backup the results to some other place so that you don't
have to do this step again in
+case it gets accidentally erased.</em></p>
+</li>
+<li>
+<p>This would have created the chunks in HDFS. Verify the same by executing
+<code>hadoop fs -ls wikipedia/chunks</code> and it'll list all the xml files
as chunk-0001.xml and so on.</p>
+</li>
 <li>
 <p>Create the countries based Split of wikipedia dataset.
-{code}$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator  -i wikipedia/chunks
--o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt</p>
-<h1 id="verify-the-creation-of-input-data-set-by-executing-code-hadoop-fs-ls">Verify
the creation of input data set by executing {code} hadoop fs -ls</h1>
-<p>wikipediainput {code} and you'll be able to see part-r-00000 file inside
+<code>$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator   -i wikipedia/chunks
+-o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt</code>.</p>
+</li>
+</ol>
+<p><br><br></p>
+<p>After input preparation start the actual training:</p>
+<ul>
+<li>
+<p>Verify the creation of input data set by executing <code>hadoop fs -ls wikipediainput</code>
and you'll be able to see part-r-00000 file inside
 wikipediainput directory</p>
-<h1 id="train-the-classifier-codemahout_homebinmahout-trainclassifier-i">Train the
classifier: {code}$MAHOUT_HOME/bin/mahout trainclassifier -i</h1>
-<p>wikipediainput -o wikipediamodel{code}. The model file will be available in
+</li>
+<li>
+<p>Train the classifier: <code>$MAHOUT_HOME/bin/mahout trainclassifier -i
+wikipediainput -o wikipediamodel</code>. The model file will be available in
 the wikipediamodel folder in HDFS.</p>
-<h1 id="test-the-classifier-codemahout_homebinmahout-testclassifier-m">Test the classifier:
{code}$MAHOUT_HOME/bin/mahout testclassifier -m</h1>
-<p>wikipediamodel -d wikipediainput{code}</p>
 </li>
-</ol>
+<li>
+<p>Test the classifier: <code>$MAHOUT_HOME/bin/mahout testclassifier -m
+wikipediamodel -d wikipediainput</code></p>
+</li>
+</ul>
    </div>
   </div>     
 </div> 



Mime
View raw message