step 1 : U can provide ur own sample data set using the prepare20news
example
just provide ur input dir.This is to perform some normalization on each
file.This is a must
stpe2 : Train the classifier with the normalized list of files.
u get a model dir which contains the trained data set in hdfs.
step3 : Test the classifier
By using the trained model and sample input u can test the classifier
Regards
Sreejith
On Fri, Nov 19, 2010 at 1:15 PM, Divya <divya@k2associates.com.sg> wrote:
> for my first question u say we can put our own input documents in directory
> that documents also should be of format similar to bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Results below :
>
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
> comp.sys.mac.hardware -121323.6282757108 547567.2698760114
> -0.2215684445551005
> 2
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space
> -189203.04544769705 547567.2698760114 -0.3455338838834164
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles
> -138625.2628242977 547567.2698760114 -0.25316572127418674
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos
> -136935.18434679657 547567.2698760114 -0.25007919917821886
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics
> -161979.38306986375 547567.2698760114 -0.29581640828631267
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: talk.politics.misc
> -159579.70032298338 547567.2698760114 -0.29143396455949216
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med
> -183835.5334355675 547567.2698760114 -0.3357314133790253
> 10/11/19 10:45:12 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 0 ?%
> Incorrectly Classified Instances : 0 ?%
> Total Classified Instances : 0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a b c d e f g h i j
> k l m n o p q r
> s t <--Classified as
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 a = rec.sport.baseball
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 b = sci.crypt
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 c = rec.sport.hockey
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 d = talk.politics.guns
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 e = soc.religion.christian
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 f = sci.electronics
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 g = comp.os.ms-windows.misc
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 h = misc.forsale
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 i = talk.religion.misc
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 j = alt.atheism
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 k = comp.windows.x
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 l = talk.politics.mideast
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 m = comp.sys.ibm.pc.hardware
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 n = comp.sys.mac.hardware
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 o = sci.space
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 p = rec.motorcycles
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 q = rec.autos
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 r = comp.graphics
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 s = talk.politics.misc
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 | 0 t = sci.med
> Default Category: unknown: 20
>
>
> 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms
>
> Am I missing anything .
>
>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>
>
> @ Jaganadh - Thanks for clearing my doubts
>
> Regards,
> Divya
>
>
> -----Original Message-----
> From: JAGANADH G [mailto:jaganadhg@gmail.com]
> Sent: Friday, November 19, 2010 3:09 PM
> To: user@mahout.apache.org
> Subject: Re: classification example doubts
>
> >
> > 1) I want to know what should go in "bayes-test-input".
> >
> >
> After preparing the 20news-group data for training you can separate some
> documents for testing your classifier.
> These documents should go to "bayes-test-input".
>
> Or ven you can put a new set of documets in the directory .
>
>
> > 2) If we take Wikipedia example
> > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> >
> >
> >
> > To trainclassifier We have used Wikipediainput to generate model .
> >
> > To test classifier again we used wikipediamodel as input and Wikipedia
> > input
> > as test documents directory.
> >
> > I didn't understand why are we doing so ?
> >
> >
>
> We are testing the classifier against the development set we used.
>
>
>
> > 3) Last thing I want to know that when we use run testclassifier
> using
> > command line we can see the output.
> >
> > How can we make use of this output?
> >
>
>
> Are you looking for Mahout API usgae for classification ?
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>
>
|