mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: classification example doubts
Date Fri, 19 Nov 2010 09:35:52 GMT
On Fri, Nov 19, 2010 at 1:15 PM, Divya <> wrote:

> for my first question u say we can put our own input documents in directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the files
> of train-input directory
> Still same result.
> Why is it not reading my files?
> Am I missing anything .
I think some thing happened wrong with your training .
I trained 20-news groups and tested it. My result is available at . Check it.

The commad which i used for
1) Preparing data is
 bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
2) to train :
bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
3) to test :
bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method

The result is available at

> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?

The documets are not acting as model. Mahout TrainClassifierr will create a
model out of the documents provided for training.
The command testclassifier takes following arguments
1) a directory containing model (specified after -m )
2) a directory which containing documents for testing the classifier.
(specified after -d ) . Documents in this directory should be formatted like
the wat we prepared document for training
3) type of the classifier algo . Here I used bayes (specified after -type )
4) Defuault category name (specified after -default) you can set it as
4) Value of Alpha_i used in training (specified after -a ). By default it is
5) Source of model dir (specified after -source). You can set it as hdfs
6) Ngram sixe (specified after -ng) . The ngram size should be same as you
used in training

A sample command with all these parameters are shown below
bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1

> Doesn't Mahout works in same way ?
> Third question, yeah I am looking for Mahout's API for classification.

A sample program is given below

For working it in real-time system you have to some more work . Find it :-)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message