mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JAGANADH G <jagana...@gmail.com>
Subject Re: classification example doubts
Date Fri, 19 Nov 2010 09:35:52 GMT
On Fri, Nov 19, 2010 at 1:15 PM, Divya <divya@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Am I missing anything .
>
>
I think some thing happened wrong with your training .
I trained 20-news groups and tested it. My result is available at
http://pastebin.com/kGY4LmW7 . Check it.

The commad which i used for
1) Preparing data is
 bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
2) to train :
bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
2
3) to test :
bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method
sequential

The result is available at http://pastebin.com/kGY4LmW7


>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
>


The documets are not acting as model. Mahout TrainClassifierr will create a
model out of the documents provided for training.
The command testclassifier takes following arguments
1) a directory containing model (specified after -m )
2) a directory which containing documents for testing the classifier.
(specified after -d ) . Documents in this directory should be formatted like
the wat we prepared document for training
3) type of the classifier algo . Here I used bayes (specified after -type )
4) Defuault category name (specified after -default) you can set it as
"unknown"
4) Value of Alpha_i used in training (specified after -a ). By default it is
1.0
5) Source of model dir (specified after -source). You can set it as hdfs
6) Ngram sixe (specified after -ng) . The ngram size should be same as you
used in training

A sample command with all these parameters are shown below
bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1


> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>

A sample program is given below

http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl/ClassifierDemo.java

For working it in real-time system you have to some more work . Find it :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message