mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > TwentyNewsgroups
Date Sun, 07 Feb 2010 16:01:00 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: TwentyNewsgroups (http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups)


Edited by Robin Anil:
---------------------------------------------------------------------
h1. Twenty Newsgroups Classification

[Get Mahout|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup]


Assume the environment variable $HADOOP_HOME refers to the location where you checked out/installed
Hadoop
Assume the environment variable $MAHOUT_HOME refers to the location where you checked out/installed
Mahout

After downloading the distribution, unzip/untar it into the directory of your choice and do:

h2. Setup:

# In trunk, mvn install // This will compile everything and create the Hadoop Job.
# cd examples

NOTE: For mahout 0.1 release do the following

# If you've run this before, you may want to rm -rf the work and temp directories
# ant -f build-deprecated.xml get-files  //Note, we are in the process of updating to Maven
# mkdir lib  //NOTE The next few steps are a workaround for the interim while we fully migrate
to Maven
# mvn dependency:copy-dependencies -DoutputDirectory=lib
# ant -f build-deprecated.xml extract-20news-18828 -Ddest=target
# mv 20news-18828-collapse 20news-input

NOTE: After you have done this, skip to the hadoop section to run the 20newsgroups example
in mahout 0.1 release

For mahout releases >0.2 run the commands in the following order to execute 20newsgroups
example without a hadoop cluster. We assume that 20newsgroups dataset is downloaded into the
examples directory
To generate input dataset:
{noformat}
$ tar zxf 20news-18828.tar.gz  //extract 20newsgroups dataset
$ mkdir 20news-input
$ mvn -e  exec:java   -Dexec.mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
-Dexec.args="-p 20news-18828 -o 20news-input -a org.apache.lucene.analysis.standard.StandardAnalyzer
-c UTF-8"
{noformat}
To Train the classifier:
{noformat}
$ mvn -e  exec:java   -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier
-Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 -source hdfs"
{noformat}
To Test over the input:
{noformat}
$ mvn -e  exec:java   -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier -Dexec.args="-m
20news-model -d 20news-input -type cbayes -ng 1 -source hdfs -method sequential"
{noformat}

h2. Running 20newsgroups example over hadoop cluster

h3. Set Up Hadoop Cluster
# emacs $HADOOP_HOME/conf/hadoop-site.xml (add in local settings per [quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html])
# $HADOOP_HOME/bin/hadoop namenode -format  //Format the HDFS
# $HADOOP_HOME/bin/start-all.sh  //Start Hadoop
# $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/20news-input 20news-input  //Copies
the extracted text to HDFS
 
Example: 
Train the Bayes Classifier using tri-grams:
{code}$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.classifier.bayes.TrainClassifier
-i 20news-input -o newsmodel -ng 3 -type bayes{code}
This will run 4 map reduce jobs on Hadoop to train the classifier and will take a while on
a single node machine. You can monitor the status of these jobs by opening a web browser on
your Job Tracker node: http://localhost:50030/jobtracker.jsp

Test classifier over the input folder
{code}$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.classifier.bayes.TestClassifier
-p newsmodel -t work/20news-input -ng 3 -type bayes{code}

Output might look like:
{code}
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism	96.9962453066333	775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics	99.28057553956835	966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc	96.95431472081218	955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware	99.59266802443992	978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware	99.47970863683663	956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x	99.59183673469387	976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale	98.45679012345678	957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos	99.4949494949495	985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles	100.0	994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball	99.89939637826961	993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey	99.89989989989989	998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt	99.39455095862765	985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics	98.98063200815494	971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med	99.79797979797979	988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space	99.3920972644377	981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian	99.49849548645938	992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns	99.45054945054945	905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast	98.82978723404256	929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc	89.93548387096774	697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc	61.78343949044586	388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18369	   97.5621%
Incorrectly Classified Instances        :        459	    2.4379%
Total Classified Instances              :      18828

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    
p    	q    	r    	s    	t    	<--Classified as
994  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    
0    	0    	0    	0    	0    	 |  994   	a     = rec.motorcycles
0    	976  	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	0    	0    
0    	0    	0    	2    	1    	 |  980   	b     = comp.windows.x
7    	0    	929  	1    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	2    
0    	0    	0    	0    	0    	 |  940   	c     = talk.politics.mideast
0    	0    	0    	905  	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    
0    	3    	0    	1    	0    	 |  910   	d     = talk.politics.guns
4    	1    	4    	27   	388  	1    	0    	1    	0    	5    	1    	1    	2    	2    	149  
7    	2    	33   	0    	0    	 |  628   	e     = talk.religion.misc
3    	0    	0    	0    	0    	985  	0    	1    	0    	0    	0    	0    	0    	1    	0    
0    	0    	0    	0    	0    	 |  990   	f     = rec.autos
0    	0    	0    	0    	0    	0    	993  	1    	0    	0    	0    	0    	0    	0    	0    
0    	0    	0    	0    	0    	 |  994   	g     = rec.sport.baseball
0    	0    	0    	0    	0    	0    	1    	998  	0    	0    	0    	0    	0    	0    	0    
0    	0    	0    	0    	0    	 |  999   	h     = rec.sport.hockey
0    	0    	0    	0    	0    	0    	0    	0    	956  	0    	2    	0    	0    	0    	0    
0    	0    	0    	2    	1    	 |  961   	i     = comp.sys.mac.hardware
0    	0    	0    	0    	0    	0    	0    	0    	0    	981  	0    	0    	5    	0    	0    
1    	0    	0    	0    	0    	 |  987   	j     = sci.space
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	978  	0    	1    	0    	0    
0    	0    	0    	2    	1    	 |  982   	k     = comp.sys.ibm.pc.hardware
1    	0    	3    	36   	0    	1    	2    	1    	0    	5    	0    	697  	4    	0    	3    
3    	19   	0    	0    	0    	 |  775   	l     = talk.politics.misc
0    	2    	0    	0    	0    	0    	0    	0    	0    	0    	2    	0    	966  	0    	0    
0    	0    	0    	2    	1    	 |  973   	m     = comp.graphics
1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	6    	0    	0    	971  	0    
0    	0    	0    	3    	0    	 |  981   	n     = sci.electronics
1    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	0    	0    	0    	992  
1    	0    	1    	0    	1    	 |  997   	o     = soc.religion.christian
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	0    	0    
988  	0    	0    	0    	1    	 |  990   	p     = sci.med
0    	0    	0    	2    	0    	0    	0    	0    	0    	0    	0    	0    	2    	1    	0    
0    	985  	0    	1    	0    	 |  991   	q     = sci.crypt
0    	0    	0    	1    	1    	0    	0    	0    	0    	1    	0    	0    	1    	0    	19   
0    	1    	775  	0    	0    	 |  799   	r     = alt.atheism
1    	0    	0    	0    	0    	3    	1    	2    	0    	0    	3    	0    	0    	5    	0    
0    	0    	0    	957  	0    	 |  972   	s     = misc.forsale
0    	0    	0    	8    	0    	0    	0    	0    	0    	0    	6    	0    	6    	0    	0    
0    	0    	0    	10   	955  	 |  985   	t     = comp.os.ms-windows.misc

{code}




h2. Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams
{code}$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.classifier.bayes.TrainClassifier
-i 20news-input -o 20news-model -ng 2 -type cbayes -source <hdfs|hbase>{code}

To Test a CBayes Classifier using bi-grams
{code}$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.classifier.bayes.TestClassifier
-p 20news-model -t work/20news-input -ng 2 -type cbayes -source <hdfs|hbase> -method
<sequential|mapreduce>{code}



Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message