Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 38994 invoked from network); 6 Dec 2010 14:13:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Dec 2010 14:13:24 -0000 Received: (qmail 5916 invoked by uid 500); 6 Dec 2010 14:13:24 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 5751 invoked by uid 500); 6 Dec 2010 14:13:23 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 5743 invoked by uid 99); 6 Dec 2010 14:13:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Dec 2010 14:13:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wangfanjie@gmail.com designates 209.85.215.43 as permitted sender) Received: from [209.85.215.43] (HELO mail-ew0-f43.google.com) (209.85.215.43) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Dec 2010 14:13:18 +0000 Received: by ewy22 with SMTP id 22so7977126ewy.16 for ; Mon, 06 Dec 2010 06:12:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=25iZnXWBOfHTpAxClHsLa1YZc+JkidVW4psy9U4gLTI=; b=s6tk4mUERVbDQAlNuzdonVxspw91UzYSCkNjSaR2hFflM73q+xsalTWnEFA04db2yt GcfCVYsB5GLc+4HdW5TlwEd8vA+jC15ALdW/uiRbgW058cD46zm0CPQ79MP2CsSvwxdW jvxn/i0uvOf0g0SZ3jUUYQIMuHe63woxy73PU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=S5hqbd7Zc3zHYdUZVzzP+sv27gadLgD4F48pnqE/ZaQtK50I0c6M25kjCSeeNawoZg BnGI3SZppznTZr4DPQH6+4aBF2Wt5Wi+ciPly+deZKchkFi8EHxGwuYgWDqtClasl+na Dk3trtL62ql8RXX7Lmh1xFwSisKfnptrfTGMc= MIME-Version: 1.0 Received: by 10.213.4.11 with SMTP id 11mr996092ebp.3.1291644775831; Mon, 06 Dec 2010 06:12:55 -0800 (PST) Received: by 10.213.33.143 with HTTP; Mon, 6 Dec 2010 06:12:55 -0800 (PST) In-Reply-To: <00d901cb8af6$07956410$16c02c30$@com.sg> References: <00d901cb8af6$07956410$16c02c30$@com.sg> Date: Mon, 6 Dec 2010 06:12:55 -0800 Message-ID: Subject: Re: FW: classification example doubts From: Frank Wang To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001517478c100a180b0496be7c56 --001517478c100a180b0496be7c56 Content-Type: text/plain; charset=ISO-8859-1 I'm seeing this problem on Ubuntu as well. *Issue 1:* Test result is all 0's. http://pastebin.com/CicVMpST The steps are: 1. Train: $MAHOUT_HOME/bin/mahout testclassifier -m newsmodel -d 20news-input -type bayes -ng 1 -source hdfs -method sequential 2. Test $MAHOUT_HOME/bin/mahout trainclassifier -i 20news-input -o newsmodel -type bayes -ng 1 -source hdfs The output are all 0's. *Issue 2:* Also, when I do "bin/mahout trainclassifier -i examples/bin/work/20news-bydate/bayes-train-input -o examples/bin/work/20news-bydate/bayes-model" I get the error "Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/root/examples/bin/work/20news-bydate/bayes-train-input" I digged into the code, it seems that trainclassifier only accepts HDFS or HBASE, is there a way to read file directly from a directory? On Tue, Nov 23, 2010 at 2:05 AM, Divya wrote: > Hi, > > I am able to get the results when I run the test classifier. > Can view my results @ http://pastebin.com/D5ejTwEW > > Steps I followed > 1)generate input data set with to train the classifier > $ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p > examples/bin/work/20news-bydate/20news-bydat > e-train -o examples/bin/work/20news-bydate/bayes-train-input -a > org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 > 2)Generate train input data set to test the classifier > bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p > examples/bin/work/20news-bydate/20news-bydate-test > -o examples/bin/work/20news-bydate/bayes-test-input -a > org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 > 3)Train the classifier > bin/mahout trainclassifier -i > examples/bin/work/20news-bydate/bayes-train-input -o > examples/bin/work/20news-bydate/bayes-model > 4)Test the classifier > bin/mahout testclassifier -m examples/bin/work/20news-bydate/bayes-model -d > examples/bin/work/20news-bydate/bayes-test-input > > I have not passed any parameters except the required ones. > > But when I pass the other parameters like -type bayes -ng 3 -source hdfs > > I am not getting the expected results. > Can any one please explain me the reason behind it. > > Thanks > Regards, > Divya > > > -----Original Message----- > From: Divya [mailto:divya@k2associates.com.sg] > Sent: Tuesday, November 23, 2010 1:40 PM > To: 'user@mahout.apache.org' > Subject: RE: classification example doubts > > I am following same steps > But no success... > > -----Original Message----- > From: Sreejith S [mailto:srssreejith@gmail.com] > Sent: Friday, November 19, 2010 4:00 PM > To: user@mahout.apache.org > Subject: Re: classification example doubts > > step 1 : U can provide ur own sample data set using the prepare20news > example > just provide ur input dir.This is to perform some normalization on each > file.This is a must > > stpe2 : Train the classifier with the normalized list of files. > u get a model dir which contains the trained data set in hdfs. > > step3 : Test the classifier > By using the trained model and sample input u can test the classifier > > Regards > Sreejith > > > On Fri, Nov 19, 2010 at 1:15 PM, Divya wrote: > > > for my first question u say we can put our own input documents in > directory > > that documents also should be of format similar to bayes-train-input. > > If yes, then I generated my input data using PrepareTwentyNewsgroups. > > And used that as my input for testclassifier > > But didn't get expected results. > > As I observed it didn't read my files I my input directory > > I tried replacing one of the files of input directory with one of the > files > > of train-input directory > > Still same result. > > Why is it not reading my files? > > > > Results below : > > > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: > > comp.sys.mac.hardware -121323.6282757108 547567.2698760114 > > -0.2215684445551005 > > 2 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space > > -189203.04544769705 547567.2698760114 -0.3455338838834164 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles > > -138625.2628242977 547567.2698760114 -0.25316572127418674 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos > > -136935.18434679657 547567.2698760114 -0.25007919917821886 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics > > -161979.38306986375 547567.2698760114 -0.29581640828631267 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: > talk.politics.misc > > -159579.70032298338 547567.2698760114 -0.29143396455949216 > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med > > -183835.5334355675 547567.2698760114 -0.3357314133790253 > > 10/11/19 10:45:12 INFO bayes.TestClassifier: > > ======================================================= > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 0 ?% > > Incorrectly Classified Instances : 0 ?% > > Total Classified Instances : 0 > > > > ======================================================= > > Confusion Matrix > > ------------------------------------------------------- > > a b c d e f g h i j > > k l m n o p q r > > s t <--Classified as > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 a = rec.sport.baseball > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 b = sci.crypt > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 c = rec.sport.hockey > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 d = talk.politics.guns > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 e = soc.religion.christian > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 f = sci.electronics > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 g = comp.os.ms-windows.misc > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 h = misc.forsale > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 i = talk.religion.misc > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 j = alt.atheism > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 k = comp.windows.x > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 l = talk.politics.mideast > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 m = comp.sys.ibm.pc.hardware > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 n = comp.sys.mac.hardware > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 o = sci.space > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 p = rec.motorcycles > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 q = rec.autos > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 r = comp.graphics > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 s = talk.politics.misc > > 0 0 0 0 0 0 0 0 0 0 > > 0 0 0 0 0 0 0 0 > > 0 0 | 0 t = sci.med > > Default Category: unknown: 20 > > > > > > 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms > > > > Am I missing anything . > > > > > > Come to my second question, that means we are testing the classifier > > against > > our inputs itself. > > Still I didn't understand. > > What I understood about classification is we have set of documents which > > will act as model for classification of new documents in the system. > > Am I right? > > Doesn't Mahout works in same way ? > > > > Third question, yeah I am looking for Mahout's API for classification. > > > > > > @ Jaganadh - Thanks for clearing my doubts > > > > Regards, > > Divya > > > > > > -----Original Message----- > > From: JAGANADH G [mailto:jaganadhg@gmail.com] > > Sent: Friday, November 19, 2010 3:09 PM > > To: user@mahout.apache.org > > Subject: Re: classification example doubts > > > > > > > > 1) I want to know what should go in "bayes-test-input". > > > > > > > > After preparing the 20news-group data for training you can separate some > > documents for testing your classifier. > > These documents should go to "bayes-test-input". > > > > Or ven you can put a new set of documets in the directory . > > > > > > > 2) If we take Wikipedia example > > > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html > > > > > > > > > > > > To trainclassifier We have used Wikipediainput to generate model . > > > > > > To test classifier again we used wikipediamodel as input and Wikipedia > > > input > > > as test documents directory. > > > > > > I didn't understand why are we doing so ? > > > > > > > > > > We are testing the classifier against the development set we used. > > > > > > > > > 3) Last thing I want to know that when we use run testclassifier > > using > > > command line we can see the output. > > > > > > How can we make use of this output? > > > > > > > > > Are you looking for Mahout API usgae for classification ? > > > > -- > > ********************************** > > JAGANADH G > > http://jaganadhg.freeflux.net/blog > > > > > > --001517478c100a180b0496be7c56--