mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Wang <wangfan...@gmail.com>
Subject Re: FW: classification example doubts
Date Mon, 06 Dec 2010 14:14:43 GMT
excuse me please, a typo in my previous post. The train and test calls were
reversed.

On Mon, Dec 6, 2010 at 6:12 AM, Frank Wang <wangfanjie@gmail.com> wrote:

> I'm seeing this problem on Ubuntu as well.
>
> *Issue 1:*
> Test result is all 0's.
> http://pastebin.com/CicVMpST
>
> The steps are:
> 1. Train:
> $MAHOUT_HOME/bin/mahout testclassifier   -m newsmodel   -d 20news-input
> -type bayes   -ng 1   -source hdfs   -method sequential
>
> 2. Test
> $MAHOUT_HOME/bin/mahout trainclassifier   -i 20news-input   -o newsmodel
> -type bayes   -ng 1   -source hdfs
>
> The output are all 0's.
>
> *Issue 2:*
> Also, when I do
> "bin/mahout trainclassifier
> -i examples/bin/work/20news-bydate/bayes-train-input
> -o examples/bin/work/20news-bydate/bayes-model"
>
> I get the error
> "Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/examples/bin/work/20news-bydate/bayes-train-input"
>
> I digged into the code, it seems that trainclassifier only accepts HDFS or
> HBASE, is there a way to read file directly from a directory?
>
>
> On Tue, Nov 23, 2010 at 2:05 AM, Divya <divya@k2associates.com.sg> wrote:
>
>> Hi,
>>
>> I am able to get the results when I run the test classifier.
>> Can view my results @ http://pastebin.com/D5ejTwEW
>>
>> Steps I followed
>> 1)generate input data set with to train the classifier
>> $ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
>>  -p
>> examples/bin/work/20news-bydate/20news-bydat
>> e-train -o examples/bin/work/20news-bydate/bayes-train-input -a
>> org.apache.mahout.vectorizer.DefaultAnalyzer  -c UTF-8
>> 2)Generate train input data set to test the classifier
>> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups  -p
>> examples/bin/work/20news-bydate/20news-bydate-test
>> -o examples/bin/work/20news-bydate/bayes-test-input -a
>> org.apache.mahout.vectorizer.DefaultAnalyzer  -c UTF-8
>> 3)Train the classifier
>> bin/mahout trainclassifier -i
>> examples/bin/work/20news-bydate/bayes-train-input -o
>> examples/bin/work/20news-bydate/bayes-model
>> 4)Test the classifier
>> bin/mahout testclassifier -m examples/bin/work/20news-bydate/bayes-model
>> -d
>> examples/bin/work/20news-bydate/bayes-test-input
>>
>> I have not passed any parameters except the required ones.
>>
>> But when I pass the other  parameters like  -type bayes -ng 3 -source hdfs
>>
>> I am not getting the expected results.
>> Can any one please explain me the reason behind it.
>>
>> Thanks
>> Regards,
>> Divya
>>
>>
>> -----Original Message-----
>> From: Divya [mailto:divya@k2associates.com.sg]
>> Sent: Tuesday, November 23, 2010 1:40 PM
>> To: 'user@mahout.apache.org'
>> Subject: RE: classification example doubts
>>
>> I am following same steps
>> But no success...
>>
>> -----Original Message-----
>> From: Sreejith S [mailto:srssreejith@gmail.com]
>> Sent: Friday, November 19, 2010 4:00 PM
>> To: user@mahout.apache.org
>> Subject: Re: classification example doubts
>>
>> step 1 : U can provide ur own sample data set using the prepare20news
>> example
>>  just provide ur input dir.This is to perform some normalization on each
>> file.This is a must
>>
>> stpe2 : Train the classifier with the normalized list of files.
>> u get a model dir which contains the trained data set in hdfs.
>>
>> step3 : Test the classifier
>> By using the trained model and sample input u can test the classifier
>>
>> Regards
>> Sreejith
>>
>>
>> On Fri, Nov 19, 2010 at 1:15 PM, Divya <divya@k2associates.com.sg> wrote:
>>
>> > for my first question u say we can put our own input documents in
>> directory
>> > that documents also should be of format similar to  bayes-train-input.
>> > If yes, then I generated my input data using PrepareTwentyNewsgroups.
>> > And used that as my input for testclassifier
>> > But didn't get expected results.
>> > As I observed it didn't read my files I my input directory
>> > I tried replacing one of the files of input directory with one of the
>> files
>> > of train-input directory
>> > Still same result.
>> > Why is it not reading my files?
>> >
>> > Results below :
>> >
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
>> > comp.sys.mac.hardware -121323.6282757108 547567.2698760114
>> > -0.2215684445551005
>> > 2
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space
>> > -189203.04544769705 547567.2698760114 -0.3455338838834164
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles
>> > -138625.2628242977 547567.2698760114 -0.25316572127418674
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos
>> > -136935.18434679657 547567.2698760114 -0.25007919917821886
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics
>> > -161979.38306986375 547567.2698760114 -0.29581640828631267
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
>> talk.politics.misc
>> > -159579.70032298338 547567.2698760114 -0.29143396455949216
>> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med
>> > -183835.5334355675 547567.2698760114 -0.3357314133790253
>> > 10/11/19 10:45:12 INFO bayes.TestClassifier:
>> > =======================================================
>> > Summary
>> > -------------------------------------------------------
>> > Correctly Classified Instances          :          0             ?%
>> > Incorrectly Classified Instances        :          0             ?%
>> > Total Classified Instances              :          0
>> >
>> > =======================================================
>> > Confusion Matrix
>> > -------------------------------------------------------
>> > a       b       c       d       e       f       g       h       i
>> j
>> > k       l       m       n       o       p       q     r
>> >        s       t       <--Classified as
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           a     = rec.sport.baseball
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           b     = sci.crypt
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           c     = rec.sport.hockey
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           d     = talk.politics.guns
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           e     = soc.religion.christian
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           f     = sci.electronics
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           g     = comp.os.ms-windows.misc
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           h     = misc.forsale
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           i     = talk.religion.misc
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           j     = alt.atheism
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           k     = comp.windows.x
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           l     = talk.politics.mideast
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           m     = comp.sys.ibm.pc.hardware
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           n     = comp.sys.mac.hardware
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           o     = sci.space
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           p     = rec.motorcycles
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           q     = rec.autos
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           r     = comp.graphics
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           s     = talk.politics.misc
>> > 0       0       0       0       0       0       0       0       0
>> 0
>> > 0       0       0       0       0       0       0     0
>> >        0       0        |  0           t     = sci.med
>> > Default Category: unknown: 20
>> >
>> >
>> > 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms
>> >
>> > Am I missing anything .
>> >
>> >
>> > Come to my second question, that means we are testing the classifier
>> > against
>> > our inputs itself.
>> > Still I didn't understand.
>> > What I understood about classification is we have set of documents which
>> > will act as model for classification of new documents in the system.
>> > Am I right?
>> > Doesn't Mahout works in same way ?
>> >
>> > Third question, yeah I am looking for Mahout's API for classification.
>> >
>> >
>> > @ Jaganadh - Thanks for clearing my doubts
>> >
>> > Regards,
>> > Divya
>> >
>> >
>> > -----Original Message-----
>> > From: JAGANADH G [mailto:jaganadhg@gmail.com]
>> > Sent: Friday, November 19, 2010 3:09 PM
>> > To: user@mahout.apache.org
>> > Subject: Re: classification example doubts
>> >
>> > >
>> > > 1)      I want to  know what should go in "bayes-test-input".
>> > >
>> > >
>> > After preparing the 20news-group data for training you can separate some
>> > documents for testing your classifier.
>> > These documents should go to "bayes-test-input".
>> >
>> > Or ven you can put a new set of documets in the directory .
>> >
>> >
>> > > 2)      If we take Wikipedia example
>> > > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
>> > >
>> > >
>> > >
>> > > To  trainclassifier We have used Wikipediainput to generate model .
>> > >
>> > > To test classifier again we used wikipediamodel as input and Wikipedia
>> > > input
>> > > as test documents directory.
>> > >
>> > > I didn't understand why are we doing so ?
>> > >
>> > >
>> >
>> > We are testing the classifier against the development set we used.
>> >
>> >
>> >
>> > > 3)      Last thing I want to know that when we use run testclassifier
>> > using
>> > > command line we can see the output.
>> > >
>> > > How can we make use of this output?
>> > >
>> >
>> >
>> > Are you looking for Mahout API usgae for classification ?
>> >
>> > --
>> > **********************************
>> > JAGANADH G
>> > http://jaganadhg.freeflux.net/blog
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message