mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Twenty Newsgroups
Date Wed, 27 Oct 2010 03:54:01 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Twenty Newsgroups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups)

Change Comment:
---------------------------------------------------------------------
removed section for mahout 0.1 under the assumption that people would be using the latest
0.4. added some intro. modified the commands to use mahout utility 

Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h2. Twenty Newsgroups Classification Example

h2. Introduction
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a
popular data set for experiments in text applications of machine learning techniques, such
as text classification and text clustering. We will use Mahout Bayes Classifier to create
a model that would classify a new document into one of the 20 newsgroup. 

h2. Prerequisites

* Mahout has been downloaded ([instructions here|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup])
* Maven is available
* Your environment has the following variables:
| {{HADOOP_HOME}} | Environment variables refers to where Hadoop lives |
| {{MAHOUT_HOME}} | Environment variables refers to where Mahout lives |

h2. Setting up Mahout

After downloading the distribution, unzip/untar it into the directory of your choice and do:

# In the trunk directory, compile everything and create the Hadoop job:
{noformat}mvn install{noformat}

h3. For Mahout 0.2+:

# Create directory to download the 20newsgroup data
{noformat}
$ mkdir $MAHOUT_HOME/examples/bin/work/
$ cd  $MAHOUT_HOME/examples/bin/work/
{noformat}
# Download {{20news-bydate.tar.gz}} from the [20newsgroups dataset|http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz]
# Extract dataset:
{noformat}$ tar zxf 20news-bydate.tar.gz{noformat}
# Generate input dataset:
{noformat}
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
  -p examples/bin/work/20news-bydate/20news-bydate-train \
  -o examples/bin/work/20news-bydate/bayes-train-input \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer \
  -c UTF-8
{noformat}

h2. Running the example without hadoop

This assumes you're running Mahout 0.2+:

# Train the classifier:
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i examples/bin/work/20news-bydate/bayes-train-input \
  -o examples/bin/work/20news-bydate/bayes-model \
  -type bayes \
  -ng 1 \
  -source hdfs

{noformat}
# Test over input:
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m examples/bin/work/20news-bydate/bayes-model \
  -d examples/bin/work/20news-bydate/bayes-test-input \
  -type bayes \
  -ng 1 \
  -source hdfs \
  -method sequential
{noformat}

h2. Running 20newsgroups example over hadoop cluster

h3. Set Up Hadoop Cluster
# Edit {{hadoop-site.xml}}; add in local settings per [Hadoop quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html]
{noformat}
emacs $HADOOP_HOME/conf/hadoop-site.xml
{noformat}
# Format the HDFS
{noformat}
$ $HADOOP_HOME/bin/hadoop namenode -format
{noformat}
# Start Hadoop
{noformat}
$ $HADOOP_HOME/bin/start-all.sh
{noformat}
# Copy extracted text to HDFS
{noformat}
$ $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/bin/work/20news-bydate/bayes-train-input
20news-input
{noformat}

h3. Train Bayes Classifier Using Tri-grams

The following will run 4 map reduce jobs on Hadoop to train the classifier and will take a
while on a single node machine.

{noformat}

$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input \
  -o newsmodel \
  -type bayes \
  -ng 3 \
  -source hdfs
{noformat}

You can monitor the status of these jobs by opening a web browser on your Job Tracker node:
[http://localhost:50030/jobtracker.jsp]

Test classifier over the input folder
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input \
  -type bayes \
  -ng 3 \
  -source hdfs \
  -method mapreduce
{noformat}

Output might look like:
{noformat}
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism96.9962453066333775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics99.28057553956835966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc96.95431472081218955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware99.59266802443992978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware99.47970863683663956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x99.59183673469387976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale98.45679012345678957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos99.4949494949495985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles100.0994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball99.89939637826961993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey99.89989989989989998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt99.39455095862765985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics98.98063200815494971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med99.79797979797979988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space99.3920972644377981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian99.49849548645938992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns99.45054945054945905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast98.82978723404256929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc89.93548387096774697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc61.78343949044586388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18369   97.5621%
Incorrectly Classified Instances        :        459    2.4379%
Total Classified Instances              :      18828

=======================================================
Confusion Matrix
-------------------------------------------------------
a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r   
s    t    <--Classified as
994  0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0     |  994   a     = rec.motorcycles
0    976  0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0   
2    1     |  980   b     = comp.windows.x
7    0    929  1    0    0    0    0    0    0    0    0    1    0    2    0    0    0   
0    0     |  940   c     = talk.politics.mideast
0    0    0    905  0    0    1    0    0    0    0    0    0    0    0    0    3    0   
1    0     |  910   d     = talk.politics.guns
4    1    4    27   388  1    0    1    0    5    1    1    2    2    149  7    2    33  
0    0     |  628   e     = talk.religion.misc
3    0    0    0    0    985  0    1    0    0    0    0    0    1    0    0    0    0   
0    0     |  990   f     = rec.autos
0    0    0    0    0    0    993  1    0    0    0    0    0    0    0    0    0    0   
0    0     |  994   g     = rec.sport.baseball
0    0    0    0    0    0    1    998  0    0    0    0    0    0    0    0    0    0   
0    0     |  999   h     = rec.sport.hockey
0    0    0    0    0    0    0    0    956  0    2    0    0    0    0    0    0    0   
2    1     |  961   i     = comp.sys.mac.hardware
0    0    0    0    0    0    0    0    0    981  0    0    5    0    0    1    0    0   
0    0     |  987   j     = sci.space
0    0    0    0    0    0    0    0    0    0    978  0    1    0    0    0    0    0   
2    1     |  982   k     = comp.sys.ibm.pc.hardware
1    0    3    36   0    1    2    1    0    5    0    697  4    0    3    3    19   0   
0    0     |  775   l     = talk.politics.misc
0    2    0    0    0    0    0    0    0    0    2    0    966  0    0    0    0    0   
2    1     |  973   m     = comp.graphics
1    0    0    0    0    0    0    0    0    0    6    0    0    971  0    0    0    0   
3    0     |  981   n     = sci.electronics
1    0    0    0    0    0    0    0    1    0    0    0    0    0    992  1    0    1   
0    1     |  997   o     = soc.religion.christian
0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    988  0    0   
0    1     |  990   p     = sci.med
0    0    0    2    0    0    0    0    0    0    0    0    2    1    0    0    985  0   
1    0     |  991   q     = sci.crypt
0    0    0    1    1    0    0    0    0    1    0    0    1    0    19   0    1    775 
0    0     |  799   r     = alt.atheism
1    0    0    0    0    3    1    2    0    0    3    0    0    5    0    0    0    0   
957  0     |  972   s     = misc.forsale
0    0    0    8    0    0    0    0    0    0    6    0    6    0    0    0    0    0   
10   955   |  985   t     = comp.os.ms-windows.misc
{noformat}

h2. Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input \
  -o newsmodel \
  -type cbayes \
  -ng 2 \
  -source hdfs
{noformat}

To Test a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input \
  -type cbayes \
  -ng 2 \
  -source hdfs \
  -method mapreduce
{noformat}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message