mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Twenty Newsgroups
Date Sun, 19 Jun 2011 05:25:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Twenty Newsgroups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups)

Change Comment:
---------------------------------------------------------------------
Fixing a bug in the command for putting the data to hadoop and running the test for the classifier

Edited by Tural Badirkhanli:
---------------------------------------------------------------------
h2. Twenty Newsgroups Classification Example

h2. Introduction
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a
popular data set for experiments in text applications of machine learning techniques, such
as text classification and text clustering. We will use Mahout Bayes Classifier to create
a model that would classify a new document into one of the 20 newsgroup. 

h2. Prerequisites

* Mahout has been downloaded ([instructions here|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup])
* Maven is available
* Your environment has the following variables:
| {{HADOOP_HOME}} | Environment variables refers to where Hadoop lives |
| {{MAHOUT_HOME}} | Environment variables refers to where Mahout lives |

h2. Setting up Mahout

After downloading the distribution, unzip/untar it into the directory of your choice and do:

# In the trunk directory, compile everything and create the Hadoop job:
{noformat}mvn install{noformat}

h3. For Mahout 0.2+:

# Create directory to download the 20newsgroup data
{noformat}
$ mkdir $MAHOUT_HOME/examples/bin/work/
$ cd  $MAHOUT_HOME/examples/bin/work/
{noformat}
# Download {{20news-bydate.tar.gz}} from the [20newsgroups dataset|http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz]
# Extract dataset:
{noformat}$ tar zxf 20news-bydate.tar.gz{noformat}
# Generate input dataset:
{noformat}
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
  -p examples/bin/work/20news-bydate/20news-bydate-train \
  -o examples/bin/work/20news-bydate/bayes-train-input \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer \
  -c UTF-8
{noformat}
# Generate test dataset:
{noformat}
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
  -p examples/bin/work/20news-bydate/20news-bydate-test \
  -o examples/bin/work/20news-bydate/bayes-test-input \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer \
  -c UTF-8
{noformat}

h2. Running the example without hadoop

This assumes you're running Mahout 0.2+:

# Train the classifier:
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i examples/bin/work/20news-bydate/bayes-train-input \
  -o examples/bin/work/20news-bydate/bayes-model \
  -type bayes \
  -ng 1 \
  -source hdfs

{noformat}
# Test over input:
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m examples/bin/work/20news-bydate/bayes-model \
  -d examples/bin/work/20news-bydate/bayes-test-input \
  -type bayes \
  -ng 1 \
  -source hdfs \
  -method sequential
{noformat}

h2. Running 20newsgroups example over hadoop cluster

h3. Set Up Hadoop Cluster
# Edit {{hadoop-site.xml}}; add in local settings per [Hadoop quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html]
{noformat}
emacs $HADOOP_HOME/conf/hadoop-site.xml
{noformat}
# Format the HDFS
{noformat}
$ $HADOOP_HOME/bin/hadoop namenode -format
{noformat}
# Start Hadoop
{noformat}
$ $HADOOP_HOME/bin/start-all.sh
{noformat}
# Copy extracted text to HDFS
{noformat}
$ $HADOOP_HOME/bin/hadoop dfs -mkdir 20news-input
$ $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/bin/work/20news-bydate/bayes-train-input
20news-input/bayes-train-input
{noformat}

h3. Train Bayes Classifier Using Tri-grams

The following will run 4 map reduce jobs on Hadoop to train the classifier and will take a
while on a single node machine.

{noformat}

$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input/bayes-train-input \
  -o newsmodel \
  -type bayes \
  -ng 3 \
  -source hdfs
{noformat}

You can monitor the status of these jobs by opening a web browser on your Job Tracker node:
[http://localhost:50030/jobtracker.jsp]

Test classifier over the input folder
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input/bayes-train-input \
  -type bayes \
  -ng 3 \
  -source hdfs \
  -method mapreduce
{noformat}

Output might look like:
{noformat}
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism96.9962453066333775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics99.28057553956835966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc96.95431472081218955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware99.59266802443992978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware99.47970863683663956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x99.59183673469387976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale98.45679012345678957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos99.4949494949495985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles100.0994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball99.89939637826961993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey99.89989989989989998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt99.39455095862765985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics98.98063200815494971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med99.79797979797979988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space99.3920972644377981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian99.49849548645938992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns99.45054945054945905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast98.82978723404256929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc89.93548387096774697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc61.78343949044586388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18369   97.5621%
Incorrectly Classified Instances        :        459    2.4379%
Total Classified Instances              :      18828

=======================================================
Confusion Matrix
-------------------------------------------------------
a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r   
s    t    <--Classified as
994  0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0     |  994   a     = rec.motorcycles
0    976  0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0   
2    1     |  980   b     = comp.windows.x
7    0    929  1    0    0    0    0    0    0    0    0    1    0    2    0    0    0   
0    0     |  940   c     = talk.politics.mideast
0    0    0    905  0    0    1    0    0    0    0    0    0    0    0    0    3    0   
1    0     |  910   d     = talk.politics.guns
4    1    4    27   388  1    0    1    0    5    1    1    2    2    149  7    2    33  
0    0     |  628   e     = talk.religion.misc
3    0    0    0    0    985  0    1    0    0    0    0    0    1    0    0    0    0   
0    0     |  990   f     = rec.autos
0    0    0    0    0    0    993  1    0    0    0    0    0    0    0    0    0    0   
0    0     |  994   g     = rec.sport.baseball
0    0    0    0    0    0    1    998  0    0    0    0    0    0    0    0    0    0   
0    0     |  999   h     = rec.sport.hockey
0    0    0    0    0    0    0    0    956  0    2    0    0    0    0    0    0    0   
2    1     |  961   i     = comp.sys.mac.hardware
0    0    0    0    0    0    0    0    0    981  0    0    5    0    0    1    0    0   
0    0     |  987   j     = sci.space
0    0    0    0    0    0    0    0    0    0    978  0    1    0    0    0    0    0   
2    1     |  982   k     = comp.sys.ibm.pc.hardware
1    0    3    36   0    1    2    1    0    5    0    697  4    0    3    3    19   0   
0    0     |  775   l     = talk.politics.misc
0    2    0    0    0    0    0    0    0    0    2    0    966  0    0    0    0    0   
2    1     |  973   m     = comp.graphics
1    0    0    0    0    0    0    0    0    0    6    0    0    971  0    0    0    0   
3    0     |  981   n     = sci.electronics
1    0    0    0    0    0    0    0    1    0    0    0    0    0    992  1    0    1   
0    1     |  997   o     = soc.religion.christian
0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    988  0    0   
0    1     |  990   p     = sci.med
0    0    0    2    0    0    0    0    0    0    0    0    2    1    0    0    985  0   
1    0     |  991   q     = sci.crypt
0    0    0    1    1    0    0    0    0    1    0    0    1    0    19   0    1    775 
0    0     |  799   r     = alt.atheism
1    0    0    0    0    3    1    2    0    0    3    0    0    5    0    0    0    0   
957  0     |  972   s     = misc.forsale
0    0    0    8    0    0    0    0    0    0    6    0    6    0    0    0    0    0   
10   955   |  985   t     = comp.os.ms-windows.misc
{noformat}

h2. Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input \
  -o newsmodel \
  -type cbayes \
  -ng 2 \
  -source hdfs
{noformat}

To Test a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input \
  -type cbayes \
  -ng 2 \
  -source hdfs \
  -method mapreduce
{noformat}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message