mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout: WikipediaBayesExample (page edited)
Date Thu, 28 May 2009 13:27:03 GMT
WikipediaBayesExample (MAHOUT) edited by Grant Ingersoll
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=100721&originalVersion=4&revisedVersion=5






Content:
---------------------------------------------------------------------

h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either
the Naive Bayes or Complementary Naive Bayes implementations in Mahout.  The example (described
below) gets a Wikipedia dump and then splits it up into chunks.  These chunks are then further
split by country.  From these splits, a classifier is trained to predict what country an unseen
article should be categorized into.


h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev
with the appropriate value)

# cd <MAHOUT_HOME>/examples
# ant -f build-deprecated.xml enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml
-o  <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code} {quote}*We strongly suggest
you backup the results to some other place so that you don't have to do this step again in
case it gets accidentally erased*{quote}
# Move the chunks to HDFS:  {code}<HADOOP_HOME/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/
wikipediadump{code}
# Create the countries based Split of wikipedia dataset. {code}<HADOOP_HOME>/bin/hadoop
jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.WikipediaDatasetCreator
-i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job
org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize
3 -classifierType bayes{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput
wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar
org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  wikipediainput{code}


---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence



Mime
View raw message