Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 10EE211B2C for ; Thu, 11 Sep 2014 09:16:47 +0000 (UTC) Received: (qmail 16526 invoked by uid 500); 11 Sep 2014 09:16:45 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 16467 invoked by uid 500); 11 Sep 2014 09:16:45 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 16455 invoked by uid 99); 11 Sep 2014 09:16:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2014 09:16:45 +0000 X-ASF-Spam-Status: No, hits=0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS,URIBL_DBL_ABUSE_REDIR,URIBL_DBL_REDIR X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of luca.filipponi89@gmail.com designates 209.85.212.174 as permitted sender) Received: from [209.85.212.174] (HELO mail-wi0-f174.google.com) (209.85.212.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2014 09:16:18 +0000 Received: by mail-wi0-f174.google.com with SMTP id n3so539734wiv.7 for ; Thu, 11 Sep 2014 02:16:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=a8bB/y0IXTPAPLVhAAp+2dXa41llCdSN6Fh58wFkrdY=; b=mAfAzGLvXGxo/MOm1hIRdcitysNNhc3K0ADBmzCrVpvQTsCpfomypY1MZmkFg7My5G WwLLVwcKke0PfCjE+wFQPgMB5lYQs7DXNPqj7nsxZptnrQDro8MPbUWZi6lg8DNm/Kx1 fhDdH7IaorVH4DKsip1cvnWkF/D5MJ54eu5DAwmqJ+DpZtT0VNcLllzTS6xWcEWOqfSq ksFRoI4BMg4NyJQaIa6UefMi3lXYvHmOFfc3UtOkz4SiERw4+SufokVG7ZT5Eh1VnghL OwP5QpoQZblOT5B5jjwASzqTUy/kR+SBs4UswDSxhJJUH1kgdM9roWuRe5iyyd2fkUkp Rp+w== X-Received: by 10.180.8.230 with SMTP id u6mr5432005wia.24.1410426977362; Thu, 11 Sep 2014 02:16:17 -0700 (PDT) Received: from [10.3.8.11] ([213.215.248.131]) by mx.google.com with ESMTPSA id pm6sm480450wjb.36.2014.09.11.02.16.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 11 Sep 2014 02:16:16 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Naive Bayes Classifier Sentiment Analysis From: luca filipponi In-Reply-To: Date: Thu, 11 Sep 2014 11:16:14 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <23A739F2-1A25-4E0C-89F7-970DC6D667DD@gmail.com> <7854A43D-E993-4FDF-B647-2D849C19E7E3@gmail.com> <65AE79C5-A817-4501-B39F-F52575F1FEE4@gmail.com> To: Mailing list mahout X-Mailer: Apple Mail (2.1878.6) X-Virus-Checked: Checked by ClamAV on apache.org Finally I=92ve implemented a Naive Bayes Classifier for Sentiment = analysis and works quite good, but I=92ve few questions. The training phase creates a .bin file that is the model of the = classifier, I=92ve tried to read but I can=92t. What does the .bin file represent? I=92m asking this because I=92d like to know better how the classifier = works, where I can read something about its implementation? Thank in advance, your help was irrepleaceble to create my classifier. On 29 Jul 2014, at 18:40, vaibhav srivastava = wrote: > Hi Filipponi, > This case testnb will not work. As in the end part of it code its = takes > label to print the confusion matrix. >=20 > if you want to use your Model to predict what are the possible out = come, > you have to use the class "TestNaiveBayesDriver.java" to write that. >=20 > and comment this section /*if (bestIdx !=3D Integer.MIN_VALUE) { > ClassifierResult classifierResult =3D new > ClassifierResult(labelMap.get(bestIdx), bestScore); > analyzer.addInstance(pair.getFirst().toString(), = classifierResult); > } > */ > that case the output file of BayesTestMapper is the going to store = values > for you if you can use seqdumper you can get the values for key > "471685156584292353". > or suppose >=20 > Key: /471685156584292353/ Value:/471685156584292353/:{1: > = 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.1942= 4138174284086,78:0.19424138174284086 > NaiveBayesModel model =3DNaiveBayesModel.materialize(output, conf); = // > output path of Model > classifier =3D new ComplementaryNaiveBayesClassifier(model); > classifier.classifyFull(vector); // this returns A vector of > probabilities in 1 of n-1 encoding for your label. input will be the = vector > {1: > = 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.1942= 4138174284086,78:0.19424138174284086 > } > Thanks > Vaibhav. >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 > On Tue, Jul 29, 2014 at 9:06 PM, Luca Filipponi = > wrote: >=20 >> I appreciate your help, but for my lack of knowledge I didn't = understand. >>=20 >> I'll try to explain better my problem :D >>=20 >> What I've done is to create a sequence File starting from csv like = this ( >> is italian tweet :D ): >>=20 >> negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... = Per >> adesso io rido !!!!! >>=20 >> positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si = preoccupa >> di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei >> fottutissimi #80euro dagli ... >>=20 >> neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di = altre >> schifezze... http://t.co/euFbtP7hQ1 ... #Europee2014 via >>=20 >> So I create a sequence file in this way: >>=20 >>=20 >> String[] tokens =3D line.split(",", 3); >>=20 >> String label =3D tokens[0]; >> String id =3D tokens[1]; >> String message =3D tokens[2]; >> key.set("/" + label + "/" + id); >> value.set(message); >> writer.append(key, value); >>=20 >>=20 >> So I'm creating a sequence File of the form where the key = is >> composed in this way : "/label/documentID/" and the value contains = the >> original text of the document. >>=20 >> After this step I create tfidf document using mahout utilities, then = I've >> a sequence file Text,VectorWritable like this: >>=20 >> Key: /negativo/468437278663409666 >> = Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.288408893= 3275849,241:0.2772479861583959,309:0.22061363650715415} >>=20 >> Then I am using the command on the newly created vector: >>=20 >> ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c >>=20 >> And then: >>=20 >> ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o >> trainingVectorTest-result -c >>=20 >> and this is the output: >>=20 >> 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary = Results: >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >> Summary >> ------------------------------------------------------- >> Correctly Classified Instances : 112 99,115% >> Incorrectly Classified Instances : 1 0,885% >> Total Classified Instances : 113 >>=20 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >> Confusion Matrix >> ------------------------------------------------------- >> a b c <--Classified as >> 47 0 0 | 47 a =3D negativo >> 0 41 0 | 41 b =3D neutrale >> 0 1 24 | 25 c =3D positivo >>=20 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >> Statistics >> ------------------------------------------------------- >> Kappa 0,9361 >> Accuracy 99,115% >> Reliability 74% >> Reliability (standard deviation) 0,4937 >>=20 >>=20 >> What I want to do now is to use the classifier on a new dataset that = is >> unlabeled, so I've a csv like this: >>=20 >> 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso = io >> rido !!!!! >>=20 >> So I wrote a sequence file with: >>=20 >> key=3D /documentid/ value=3D Content of the document >>=20 >> and then use mahout utilities to create a tfidf-vector: >>=20 >> Key: /471685156584292353/ >> = Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,2= 5:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086 >> ... >>=20 >> But when I use the command testnb on this new dataset I get this = exception: >>=20 >> java.lang.IllegalArgumentException: Label not found: = 471685156584292353 >>=20 >> I know that this is due, to the fact that the documentID is = recognized as >> label, but I don't know how to resolve that, could be great if you = provide >> me some similar example, becouse I can't find nothing similar. >>=20 >> Thank you so much in advance, your help is really appreciated. >>=20 >> Luca Filipponi. >>=20 >>=20 >> Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava < >> vaibhavcse30@gmail.com> ha scritto: >>=20 >>> Hi >>> The sequence file format will be Text and Vector Writable. >>> suppose you have test document named as 1,2,3,4. >>> The you can have sequence file format as Key : /test/1 Value : = >>> /test/2 Value : >>>=20 >>> this line in BayesTestMapper >>> //the key is the expected value >>>=20 >>> context.write(new Text(SLASH.split(key.toString())[1]), new >>> VectorWritable(result)); >>>=20 >>>=20 >>> and TestNaiveBayesDriver.java might help you . if you remove this = part >> from >>> this code you will not get confusion matrix and initial labels are = not >>> required. >>>=20 >>>=20 >>>=20 >>>=20 >>> if (bestIdx !=3D Integer.MIN_VALUE) { >>>=20 >>> ClassifierResult classifierResult =3D new = ClassifierResult(labelMap >>> .get(bestIdx), bestScore); >>>=20 >>> analyzer.addInstance(pair.getFirst().toString(), >> classifierResult); >>>=20 >>> } >>>=20 >>>=20 >>> your out file will contain our document name suppose 1 and label = vector >>> with its values. >>>=20 >>>=20 >>> hope this help. >>>=20 >>> Thanks, >>>=20 >>> Vaibhav >>>=20 >>> vaibhavcse30@gmail.com >>>=20 >>>=20 >>>=20 >>>=20 >>> On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi < >> luca.filipponi89@gmail.com> >>> wrote: >>>=20 >>>> I am using mahout 0.9, which part of source code should I look? >>>>=20 >>>> My problem is that I don't know how to the sequence file without = the >> label >>>> should be structured. >>>>=20 >>>> Do you have any hint? >>>>=20 >>>> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava < >>>> vaibhavcse30@gmail.com> ha scritto: >>>>=20 >>>>> Hi, >>>>> If you want to create a test set and if you do not want to measure >>>> accuracy. >>>>> Then you can make an instance of claasifier and load your model on = that >>>>> classifier and then can find the best score. >>>>> Look at navie bayes test code. >>>>> Hope this help. Thanks . >>>>> On 29 Jul 2014 12:53, "Luca Filipponi" = >>>> wrote: >>>>>=20 >>>>>> Hi , I am trying to develop sentiment analysis on italian tweet = from >>>>>> twitter using the naive bayes classifier, but I've some trouble. >>>>>>=20 >>>>>> My idea was to classify a lot of tweet as positive, negative or >>>> neautral, >>>>>> and using that as training set for the Classifier. To do that = I've >>>> wrote a >>>>>> sequence file, in the format , where in the key there = is >>>>>> /label/tweetID and in the key the text, and then the text of all = the >>>>>> dataset is converted in tfidf vector, using mahout utilities. >>>>>>=20 >>>>>> Then I'm using the command: >>>>>>=20 >>>>>> ./mahout trainnb and ./mahout testnb to check the classifier, and = the >>>>>> score is right (I've got nearly 100% because the test set is the = same >> as >>>>>> the train set) >>>>>>=20 >>>>>> My question is if I want to use a test set that is unlabeled how >> should >>>> it >>>>>> be created? because if the format isn't like: >>>>>>=20 >>>>>> key =3D /label/ the classifier can't find the label and I've got = an >>>>>> exception >>>>>>=20 >>>>>> but in a new dataset, obviously this will be unlabeled because i = need >> to >>>>>> classify that, so I don't know what put in the key of the = sequence >> file. >>>>>>=20 >>>>>> I've searched online for some example, but the only ones that = I've >> found >>>>>> use the split command, on the original dataset, and then testing = on >>>> part of >>>>>> that, but isn't my case. >>>>>>=20 >>>>>>=20 >>>>>> Every idea for developing a better sentiment analysis is welcome, >> thanks >>>>>> in advance for the help. >>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>>=20 >>>=20 >>> -- >>> Thanks and Regards, >>> Vaibhav Srivastava >>> Email-id: vaibhavcse30@gmail.com >>> Mobile no.: 9552543029 >>=20 >>=20 >=20 >=20 > --=20 > Thanks and Regards, > Vaibhav Srivastava > Email-id: vaibhavcse30@gmail.com > Mobile no.: 9552543029