mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Divya" <di...@k2associates.com.sg>
Subject RE: classification example doubts
Date Tue, 23 Nov 2010 05:30:38 GMT
Hi,

Yeah I understood the logic behind it.
First we have to provide the set of documents and train classifier build
model out of it 
And when testing classifier whenever we provide input data after generating
it in form of dataset.
It will classify those data according the built model.

Even I am doing the same thing

I am using the test input given with 20news-bydate.tar.gz data set
As when we extract 20news-bydate.tar.gz we get two directories
20news-bydate-train and 20news-bydate-test out of which I am using to train
the classifier and other to test classifier respectively.


Steps I am following -
1. Extract dataset
  tar zxf 20news-bydate.tar.gz 

2.Generate input dataset train classifier 
$ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups   -p
examples/bin/work/20news-bydate/20news-bydate-train
 -o examples/bin/work/20news-bydate/bayes-train-input  -a
org.apache.mahout.vectorizer.DefaultAnalyzer  -c UTF-8

3.Generate input dataset test classifier 
$ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p
examples/bin/work/20news-bydate/20news-bydate-test 
-o examples/bin/work/20news-bydate/20news-test-input -a
org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

4. Train the classifier
bin/mahout trainclassifier -i
examples/bin/work/20news-bydate/bayes-train-input -o
examples/bin/work/20news-bydate/bayes-model
-type bayes -ng 1 -source hdfs

5.Test classifier
$ bin/mahout testclassifier -m
D:/mahout-0.4/examples/bin/work/20news-bydate/bayes-model 
-d D:/mahout-0.4/examples/bin/work/20news-test-input -type bayes -ng 1
-method sequential

Not getting expected output. Can view my result  @
http://pastebin.com/CicVMpST.

Still trying to figure  whats missing in my steps.

Can any one help me.

Regards,
Divya 



-----Original Message-----
From: JAGANADH G [mailto:jaganadhg@gmail.com] 
Sent: Friday, November 19, 2010 5:36 PM
To: Divya
Cc: user@mahout.apache.org
Subject: Re: classification example doubts

On Fri, Nov 19, 2010 at 1:15 PM, Divya <divya@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in
directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the
files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Am I missing anything .
>
>
I think some thing happened wrong with your training .
I trained 20-news groups and tested it. My result is available at
http://pastebin.com/kGY4LmW7 . Check it.

The commad which i used for
1) Preparing data is
 bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
2) to train :
bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
2
3) to test :
bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method
sequential

The result is available at http://pastebin.com/kGY4LmW7


>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
>


The documets are not acting as model. Mahout TrainClassifierr will create a
model out of the documents provided for training.
The command testclassifier takes following arguments
1) a directory containing model (specified after -m )
2) a directory which containing documents for testing the classifier.
(specified after -d ) . Documents in this directory should be formatted like
the wat we prepared document for training
3) type of the classifier algo . Here I used bayes (specified after -type )
4) Defuault category name (specified after -default) you can set it as
"unknown"
4) Value of Alpha_i used in training (specified after -a ). By default it is
1.0
5) Source of model dir (specified after -source). You can set it as hdfs
6) Ngram sixe (specified after -ng) . The ngram size should be same as you
used in training

A sample command with all these parameters are shown below
bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1


> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>

A sample program is given below

http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl
/ClassifierDemo.java

For working it in real-time system you have to some more work . Find it :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog


Mime
View raw message