mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From luca filipponi <luca.filippon...@gmail.com>
Subject Re: Naive Bayes Classifier Sentiment Analysis
Date Thu, 11 Sep 2014 09:16:14 GMT
Finally I’ve implemented a Naive Bayes Classifier for Sentiment analysis and works quite
good, but I’ve few questions.

The training phase creates a .bin file that is the model of the classifier, I’ve tried to
read but I can’t.

What does the .bin file represent?

I’m asking this because I’d like to know better how the classifier works,  where I can
read something about its implementation?

Thank in advance, your help was irrepleaceble to create my classifier.


On 29 Jul 2014, at 18:40, vaibhav srivastava <vaibhavcse30@gmail.com> wrote:

> Hi Filipponi,
> This case testnb will not work. As in the end part of it code its takes
> label to print the confusion matrix.
> 
> if you want to use your Model to predict what are the possible out come,
> you have to use the class "TestNaiveBayesDriver.java"  to write that.
> 
> and comment this section /*if (bestIdx != Integer.MIN_VALUE) {
>        ClassifierResult classifierResult = new
> ClassifierResult(labelMap.get(bestIdx), bestScore);
>        analyzer.addInstance(pair.getFirst().toString(), classifierResult);
>      }
> */
> that case the output file of BayesTestMapper is the going to store values
> for you if you can use seqdumper you can get the values for key
> "471685156584292353".
> or suppose
> 
> Key: /471685156584292353/ Value:/471685156584292353/:{1:
> 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
>    NaiveBayesModel model =NaiveBayesModel.materialize(output, conf); //
> output path of Model
>    classifier = new ComplementaryNaiveBayesClassifier(model);
>    classifier.classifyFull(vector); // this returns A vector of
> probabilities in 1 of n-1 encoding for your label. input will be the vector
> {1:
> 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
> }
> Thanks
> Vaibhav.
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Jul 29, 2014 at 9:06 PM, Luca Filipponi <luca.filipponi89@gmail.com>
> wrote:
> 
>> I appreciate your help, but for my lack of knowledge I didn't understand.
>> 
>> I'll try to explain better my problem :D
>> 
>> What I've done is to create a sequence File starting from csv like this (
>> is italian tweet :D ):
>> 
>> negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... Per
>> adesso io rido !!!!!
>> 
>> positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si preoccupa
>> di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei
>> fottutissimi #80euro dagli ...
>> 
>> neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di altre
>> schifezze... http://t.co/euFbtP7hQ1 ... #Europee2014 via
>> 
>> So I create a sequence file in this way:
>> 
>> 
>> String[] tokens = line.split(",", 3);
>> 
>>            String label = tokens[0];
>>            String id = tokens[1];
>>            String message = tokens[2];
>>            key.set("/" + label + "/" + id);
>>            value.set(message);
>>            writer.append(key, value);
>> 
>> 
>> So I'm creating a sequence File of the form <Text,Text> where the key is
>> composed in this way : "/label/documentID/" and the value contains the
>> original text of the document.
>> 
>> After this step I create tfidf document using mahout utilities, then I've
>> a sequence file Text,VectorWritable like this:
>> 
>> Key: /negativo/468437278663409666
>> Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.2884088933275849,241:0.2772479861583959,309:0.22061363650715415}
>> 
>> Then I am using the command on the newly created vector:
>> 
>> ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c
>> 
>> And then:
>> 
>> ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o
>> trainingVectorTest-result -c
>> 
>> and this is the output:
>> 
>> 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary Results:
>> =======================================================
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :        112    99,115%
>> Incorrectly Classified Instances        :          1    0,885%
>> Total Classified Instances              :        113
>> 
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a    b    c    <--Classified as
>> 47   0    0     |  47    a     = negativo
>> 0    41   0     |  41    b     = neutrale
>> 0    1    24    |  25    c     = positivo
>> 
>> =======================================================
>> Statistics
>> -------------------------------------------------------
>> Kappa                                       0,9361
>> Accuracy                                    99,115%
>> Reliability                                     74%
>> Reliability (standard deviation)            0,4937
>> 
>> 
>> What I want to do now is to use the classifier on a new dataset that is
>> unlabeled, so I've a csv like this:
>> 
>> 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso io
>> rido !!!!!
>> 
>> So I wrote a sequence file with:
>> 
>> key= /documentid/ value= Content of the document
>> 
>> and then use mahout utilities to create a tfidf-vector:
>> 
>> Key: /471685156584292353/
>> Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
>> ...
>> 
>> But when I use the command testnb on this new dataset I get this exception:
>> 
>> java.lang.IllegalArgumentException: Label not found: 471685156584292353
>> 
>> I know that this is due, to the fact that the documentID is recognized as
>> label, but I don't know how to resolve that, could be great if you provide
>> me some similar example, becouse I can't find nothing similar.
>> 
>> Thank you so much in advance, your help is really appreciated.
>> 
>> Luca Filipponi.
>> 
>> 
>> Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava <
>> vaibhavcse30@gmail.com> ha scritto:
>> 
>>> Hi
>>> The sequence file format will be Text and Vector Writable.
>>> suppose you have test document named as 1,2,3,4.
>>> The you can have sequence file format as Key : /test/1 Value : <vectors1>
>>> /test/2 Value : <vectors2>
>>> 
>>> this line in BayesTestMapper
>>> //the key is the expected value
>>> 
>>>   context.write(new Text(SLASH.split(key.toString())[1]), new
>>> VectorWritable(result));
>>> 
>>> 
>>> and TestNaiveBayesDriver.java might help you . if you remove this part
>> from
>>> this code  you will not get confusion matrix  and initial labels are not
>>> required.
>>> 
>>> 
>>> 
>>> 
>>> if (bestIdx != Integer.MIN_VALUE) {
>>> 
>>>       ClassifierResult classifierResult = new ClassifierResult(labelMap
>>> .get(bestIdx), bestScore);
>>> 
>>>       analyzer.addInstance(pair.getFirst().toString(),
>> classifierResult);
>>> 
>>>     }
>>> 
>>> 
>>> your out file will contain our document name suppose 1 and label vector
>>> with its values.
>>> 
>>> 
>>> hope this help.
>>> 
>>> Thanks,
>>> 
>>> Vaibhav
>>> 
>>> vaibhavcse30@gmail.com
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <
>> luca.filipponi89@gmail.com>
>>> wrote:
>>> 
>>>> I am using mahout 0.9, which part of source code should I look?
>>>> 
>>>> My problem is that I don't know how to the sequence file without the
>> label
>>>> should be structured.
>>>> 
>>>> Do you have any hint?
>>>> 
>>>> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
>>>> vaibhavcse30@gmail.com> ha scritto:
>>>> 
>>>>> Hi,
>>>>> If you want to create a test set and if you do not want to measure
>>>> accuracy.
>>>>> Then you can make an instance of claasifier and load your model on that
>>>>> classifier and then can find the best score.
>>>>> Look at  navie bayes test code.
>>>>> Hope this help. Thanks .
>>>>> On 29 Jul 2014 12:53, "Luca Filipponi" <luca.filipponi89@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi , I am trying to develop sentiment analysis on italian tweet from
>>>>>> twitter using the naive bayes classifier, but I've some trouble.
>>>>>> 
>>>>>> My idea was to classify a lot of tweet as positive, negative or
>>>> neautral,
>>>>>> and using that as training set for the Classifier. To do that I've
>>>> wrote a
>>>>>> sequence file, in the format <Text,Text>, where in the key
there is
>>>>>> /label/tweetID and in the key the text, and then the text of all
the
>>>>>> dataset is converted in tfidf vector, using mahout utilities.
>>>>>> 
>>>>>> Then I'm using the command:
>>>>>> 
>>>>>> ./mahout trainnb and ./mahout testnb to check the classifier, and
the
>>>>>> score is right (I've got nearly 100% because the test set is the
same
>> as
>>>>>> the train set)
>>>>>> 
>>>>>> My question is if I want to use a test set that is unlabeled how
>> should
>>>> it
>>>>>> be created? because if the format isn't like:
>>>>>> 
>>>>>> key = /label/  the classifier can't find the label and I've got an
>>>>>> exception
>>>>>> 
>>>>>> but in a new dataset, obviously this will be unlabeled because i
need
>> to
>>>>>> classify that, so I don't know what put in the key of the sequence
>> file.
>>>>>> 
>>>>>> I've searched online for some example, but the only ones that I've
>> found
>>>>>> use the split command, on the original dataset, and then testing
on
>>>> part of
>>>>>> that, but isn't my case.
>>>>>> 
>>>>>> 
>>>>>> Every idea for developing a better sentiment analysis is welcome,
>> thanks
>>>>>> in advance for the help.
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards,
>>> Vaibhav Srivastava
>>> Email-id: vaibhavcse30@gmail.com
>>> Mobile no.: 9552543029
>> 
>> 
> 
> 
> -- 
> Thanks and Regards,
> Vaibhav Srivastava
> Email-id: vaibhavcse30@gmail.com
> Mobile no.: 9552543029


Mime
View raw message