From user-return-5328-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Fri Nov 19 08:00:14 2010 Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 97206 invoked from network); 19 Nov 2010 08:00:14 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Nov 2010 08:00:14 -0000 Received: (qmail 53295 invoked by uid 500); 19 Nov 2010 08:00:45 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 53242 invoked by uid 500); 19 Nov 2010 08:00:45 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 53231 invoked by uid 99); 19 Nov 2010 08:00:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 08:00:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of srssreejith@gmail.com designates 209.85.214.170 as permitted sender) Received: from [209.85.214.170] (HELO mail-iw0-f170.google.com) (209.85.214.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 08:00:37 +0000 Received: by iwn41 with SMTP id 41so4565953iwn.1 for ; Fri, 19 Nov 2010 00:00:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=XwMqh27KeR0ymTq08uySNv3aOD5Lw30fNN35dBAzwV4=; b=QGt2+B9rcftcRUaxgzfIe96XldybUU8pKPPi6JD0HC3rCRszz166T4XAXwi8VvpVaI uFTKmIc1L1nwDGDFgTetB7DyX/4EwOb6J3DUAcSXK6BTJ5osRlkjVfHYQHeHNn3YyR1u cSvdwyl/3FnlEiOeQSXPDvx/gmxDlXkcau7j8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tlqqfaKN4pJvj/CYfklQTYoHLftalqCxGytG/u9dnpMvbdYBtCDio+9nJR3PbqfGc6 na+aL1qNAY34rKuv+xDf6jDa41C3SNUThzaSiHJNfEcdTDeehTe9icawLs3iMSmnFRMR 1QQmsL3BDbtRRLt+mX/UDnjR26DQ34zi0k1bM= MIME-Version: 1.0 Received: by 10.231.14.10 with SMTP id e10mr1705707iba.132.1290153615805; Fri, 19 Nov 2010 00:00:15 -0800 (PST) Received: by 10.231.11.69 with HTTP; Fri, 19 Nov 2010 00:00:15 -0800 (PST) In-Reply-To: <005401cb87bd$c96836d0$5c38a470$@com.sg> References: <004901cb87b6$45f96140$d1ec23c0$@com.sg> <005401cb87bd$c96836d0$5c38a470$@com.sg> Date: Fri, 19 Nov 2010 13:30:15 +0530 Message-ID: Subject: Re: classification example doubts From: Sreejith S To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=002215046a43f9cbde0495634be9 X-Virus-Checked: Checked by ClamAV on apache.org --002215046a43f9cbde0495634be9 Content-Type: text/plain; charset=ISO-8859-1 step 1 : U can provide ur own sample data set using the prepare20news example just provide ur input dir.This is to perform some normalization on each file.This is a must stpe2 : Train the classifier with the normalized list of files. u get a model dir which contains the trained data set in hdfs. step3 : Test the classifier By using the trained model and sample input u can test the classifier Regards Sreejith On Fri, Nov 19, 2010 at 1:15 PM, Divya wrote: > for my first question u say we can put our own input documents in directory > that documents also should be of format similar to bayes-train-input. > If yes, then I generated my input data using PrepareTwentyNewsgroups. > And used that as my input for testclassifier > But didn't get expected results. > As I observed it didn't read my files I my input directory > I tried replacing one of the files of input directory with one of the files > of train-input directory > Still same result. > Why is it not reading my files? > > Results below : > > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: > comp.sys.mac.hardware -121323.6282757108 547567.2698760114 > -0.2215684445551005 > 2 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space > -189203.04544769705 547567.2698760114 -0.3455338838834164 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles > -138625.2628242977 547567.2698760114 -0.25316572127418674 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos > -136935.18434679657 547567.2698760114 -0.25007919917821886 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics > -161979.38306986375 547567.2698760114 -0.29581640828631267 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: talk.politics.misc > -159579.70032298338 547567.2698760114 -0.29143396455949216 > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med > -183835.5334355675 547567.2698760114 -0.3357314133790253 > 10/11/19 10:45:12 INFO bayes.TestClassifier: > ======================================================= > Summary > ------------------------------------------------------- > Correctly Classified Instances : 0 ?% > Incorrectly Classified Instances : 0 ?% > Total Classified Instances : 0 > > ======================================================= > Confusion Matrix > ------------------------------------------------------- > a b c d e f g h i j > k l m n o p q r > s t <--Classified as > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 a = rec.sport.baseball > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 b = sci.crypt > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 c = rec.sport.hockey > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 d = talk.politics.guns > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 e = soc.religion.christian > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 f = sci.electronics > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 g = comp.os.ms-windows.misc > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 h = misc.forsale > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 i = talk.religion.misc > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 j = alt.atheism > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 k = comp.windows.x > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 l = talk.politics.mideast > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 m = comp.sys.ibm.pc.hardware > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 n = comp.sys.mac.hardware > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 o = sci.space > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 p = rec.motorcycles > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 q = rec.autos > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 r = comp.graphics > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 s = talk.politics.misc > 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 | 0 t = sci.med > Default Category: unknown: 20 > > > 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms > > Am I missing anything . > > > Come to my second question, that means we are testing the classifier > against > our inputs itself. > Still I didn't understand. > What I understood about classification is we have set of documents which > will act as model for classification of new documents in the system. > Am I right? > Doesn't Mahout works in same way ? > > Third question, yeah I am looking for Mahout's API for classification. > > > @ Jaganadh - Thanks for clearing my doubts > > Regards, > Divya > > > -----Original Message----- > From: JAGANADH G [mailto:jaganadhg@gmail.com] > Sent: Friday, November 19, 2010 3:09 PM > To: user@mahout.apache.org > Subject: Re: classification example doubts > > > > > 1) I want to know what should go in "bayes-test-input". > > > > > After preparing the 20news-group data for training you can separate some > documents for testing your classifier. > These documents should go to "bayes-test-input". > > Or ven you can put a new set of documets in the directory . > > > > 2) If we take Wikipedia example > > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html > > > > > > > > To trainclassifier We have used Wikipediainput to generate model . > > > > To test classifier again we used wikipediamodel as input and Wikipedia > > input > > as test documents directory. > > > > I didn't understand why are we doing so ? > > > > > > We are testing the classifier against the development set we used. > > > > > 3) Last thing I want to know that when we use run testclassifier > using > > command line we can see the output. > > > > How can we make use of this output? > > > > > Are you looking for Mahout API usgae for classification ? > > -- > ********************************** > JAGANADH G > http://jaganadhg.freeflux.net/blog > > --002215046a43f9cbde0495634be9--