mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Lucene Mahout: WikipediaBayesExample (page edited)
Date Thu, 28 May 2009 13:27:03 GMT
WikipediaBayesExample (MAHOUT) edited by Grant Ingersoll


h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either
the Naive Bayes or Complementary Naive Bayes implementations in Mahout.  The example (described
below) gets a Wikipedia dump and then splits it up into chunks.  These chunks are then further
split by country.  From these splits, a classifier is trained to predict what country an unseen
article should be categorized into.

h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev
with the appropriate value)

# cd <MAHOUT_HOME>/examples
# ant -f build-deprecated.xml enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml
-o  <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code} {quote}*We strongly suggest
you backup the results to some other place so that you don't have to do this step again in
case it gets accidentally erased*{quote}
# Move the chunks to HDFS:  {code}<HADOOP_HOME/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/
# Create the countries based Split of wikipedia dataset. {code}<HADOOP_HOME>/bin/hadoop
jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.WikipediaDatasetCreator
-i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job
org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize
3 -classifierType bayes{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput
wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar
org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  wikipediainput{code}

This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences

If you think it was sent incorrectly contact one of the administrators

If you want more information on Confluence, or have a bug to report see

View raw message