mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Lucene Mahout: WikipediaBayesExample (page edited)
Date Fri, 31 Oct 2008 16:05:02 GMT
WikipediaBayesExample (MAHOUT) edited by Grant Ingersoll


h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either
the Naive Bayes or Complementary Naive Bayes implementations in Mahout.  The example (described
below) gets a Wikipedia dump and then splits it up into chunks.  These chunks are then further
split by country.  From these splits, a classifier is trained to predict what country an unseen
article should be categorized into.

h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev
with the appropriate value)

# cd <MAHOUT_HOME>/examples
# ant enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml
-o  <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code}
# Move the chunks to HDFS:  {code}<HADOOP_HOME/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/
# Create the countries based Split of wikipedia dataset. {code}<HADOOP_HOME>/bin/hadoop
jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.WikipediaDatasetCreator
-i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar
org.apache.mahout.classifier.bayes.TrainClassifier -t -i wikipediainput -o wikipediamodel{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput
wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar
org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  wikipediainput{code}

This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences

If you think it was sent incorrectly contact one of the administrators

If you want more information on Confluence, or have a bug to report see

View raw message