Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of luca.filipponi89@gmail.com
 designates 209.85.212.174 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: Naive Bayes Classifier Sentiment Analysis
From: luca filipponi <luca.filipponi89@gmail.com>
In-Reply-To: 
 <CAMoUjduNR+ORN1PPX1DJQwc-Yb5RdEcn5T4B+7AD3jMiVsNa+g@mail.gmail.com>
Date: Thu, 11 Sep 2014 11:16:14 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <B1C0BFE1-FB2E-481E-AB60-8EF5E078223F@gmail.com>
References: <23A739F2-1A25-4E0C-89F7-970DC6D667DD@gmail.com>
 <CAMoUjdsX85NTFVgHPAAwYzMRuS_q8nsoT+gWNuq0=gFqYQuCRQ@mail.gmail.com>
 <7854A43D-E993-4FDF-B647-2D849C19E7E3@gmail.com>
 <CAMoUjdvvpUoVfP7qtaUNTMSQpCGXSxgMBWs4xHVLnzOTcj6sEg@mail.gmail.com>
 <65AE79C5-A817-4501-B39F-F52575F1FEE4@gmail.com>
 <CAMoUjduNR+ORN1PPX1DJQwc-Yb5RdEcn5T4B+7AD3jMiVsNa+g@mail.gmail.com>
To: Mailing list mahout <user@mahout.apache.org>

Finally I=92ve implemented a Naive Bayes Classifier for Sentiment =
analysis and works quite good, but I=92ve few questions.

The training phase creates a .bin file that is the model of the =
classifier, I=92ve tried to read but I can=92t.

What does the .bin file represent?

I=92m asking this because I=92d like to know better how the classifier =
works,  where I can read something about its implementation?

Thank in advance, your help was irrepleaceble to create my classifier.


On 29 Jul 2014, at 18:40, vaibhav srivastava <vaibhavcse30@gmail.com> =
wrote:

> Hi Filipponi,
> This case testnb will not work. As in the end part of it code its =
takes
> label to print the confusion matrix.
>=20
> if you want to use your Model to predict what are the possible out =
come,
> you have to use the class "TestNaiveBayesDriver.java"  to write that.
>=20
> and comment this section /*if (bestIdx !=3D Integer.MIN_VALUE) {
>        ClassifierResult classifierResult =3D new
> ClassifierResult(labelMap.get(bestIdx), bestScore);
>        analyzer.addInstance(pair.getFirst().toString(), =
classifierResult);
>      }
> */
> that case the output file of BayesTestMapper is the going to store =
values
> for you if you can use seqdumper you can get the values for key
> "471685156584292353".
> or suppose
>=20
> Key: /471685156584292353/ Value:/471685156584292353/:{1:
> =
0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.1942=
4138174284086,78:0.19424138174284086
>    NaiveBayesModel model =3DNaiveBayesModel.materialize(output, conf); =
//
> output path of Model
>    classifier =3D new ComplementaryNaiveBayesClassifier(model);
>    classifier.classifyFull(vector); // this returns A vector of
> probabilities in 1 of n-1 encoding for your label. input will be the =
vector
> {1:
> =
0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.1942=
4138174284086,78:0.19424138174284086
> }
> Thanks
> Vaibhav.
>=20
>=20
>=20
>=20
>=20
>=20
>=20
>=20
> On Tue, Jul 29, 2014 at 9:06 PM, Luca Filipponi =
<luca.filipponi89@gmail.com>
> wrote:
>=20
>> I appreciate your help, but for my lack of knowledge I didn't =
understand.
>>=20
>> I'll try to explain better my problem :D
>>=20
>> What I've done is to create a sequence File starting from csv like =
this (
>> is italian tweet :D ):
>>=20
>> negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... =
Per
>> adesso io rido !!!!!
>>=20
>> positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si =
preoccupa
>> di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei
>> fottutissimi #80euro dagli ...
>>=20
>> neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di =
altre
>> schifezze... http://t.co/euFbtP7hQ1 ... #Europee2014 via
>>=20
>> So I create a sequence file in this way:
>>=20
>>=20
>> String[] tokens =3D line.split(",", 3);
>>=20
>>            String label =3D tokens[0];
>>            String id =3D tokens[1];
>>            String message =3D tokens[2];
>>            key.set("/" + label + "/" + id);
>>            value.set(message);
>>            writer.append(key, value);
>>=20
>>=20
>> So I'm creating a sequence File of the form <Text,Text> where the key =
is
>> composed in this way : "/label/documentID/" and the value contains =
the
>> original text of the document.
>>=20
>> After this step I create tfidf document using mahout utilities, then =
I've
>> a sequence file Text,VectorWritable like this:
>>=20
>> Key: /negativo/468437278663409666
>> =
Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.288408893=
3275849,241:0.2772479861583959,309:0.22061363650715415}
>>=20
>> Then I am using the command on the newly created vector:
>>=20
>> ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c
>>=20
>> And then:
>>=20
>> ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o
>> trainingVectorTest-result -c
>>=20
>> and this is the output:
>>=20
>> 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary =
Results:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :        112    99,115%
>> Incorrectly Classified Instances        :          1    0,885%
>> Total Classified Instances              :        113
>>=20
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>> Confusion Matrix
>> -------------------------------------------------------
>> a    b    c    <--Classified as
>> 47   0    0     |  47    a     =3D negativo
>> 0    41   0     |  41    b     =3D neutrale
>> 0    1    24    |  25    c     =3D positivo
>>=20
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>> Statistics
>> -------------------------------------------------------
>> Kappa                                       0,9361
>> Accuracy                                    99,115%
>> Reliability                                     74%
>> Reliability (standard deviation)            0,4937
>>=20
>>=20
>> What I want to do now is to use the classifier on a new dataset that =
is
>> unlabeled, so I've a csv like this:
>>=20
>> 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso =
io
>> rido !!!!!
>>=20
>> So I wrote a sequence file with:
>>=20
>> key=3D /documentid/ value=3D Content of the document
>>=20
>> and then use mahout utilities to create a tfidf-vector:
>>=20
>> Key: /471685156584292353/
>> =
Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,2=
5:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
>> ...
>>=20
>> But when I use the command testnb on this new dataset I get this =
exception:
>>=20
>> java.lang.IllegalArgumentException: Label not found: =
471685156584292353
>>=20
>> I know that this is due, to the fact that the documentID is =
recognized as
>> label, but I don't know how to resolve that, could be great if you =
provide
>> me some similar example, becouse I can't find nothing similar.
>>=20
>> Thank you so much in advance, your help is really appreciated.
>>=20
>> Luca Filipponi.
>>=20
>>=20
>> Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava <
>> vaibhavcse30@gmail.com> ha scritto:
>>=20
>>> Hi
>>> The sequence file format will be Text and Vector Writable.
>>> suppose you have test document named as 1,2,3,4.
>>> The you can have sequence file format as Key : /test/1 Value : =
<vectors1>
>>> /test/2 Value : <vectors2>
>>>=20
>>> this line in BayesTestMapper
>>> //the key is the expected value
>>>=20
>>>   context.write(new Text(SLASH.split(key.toString())[1]), new
>>> VectorWritable(result));
>>>=20
>>>=20
>>> and TestNaiveBayesDriver.java might help you . if you remove this =
part
>> from
>>> this code  you will not get confusion matrix  and initial labels are =
not
>>> required.
>>>=20
>>>=20
>>>=20
>>>=20
>>> if (bestIdx !=3D Integer.MIN_VALUE) {
>>>=20
>>>       ClassifierResult classifierResult =3D new =
ClassifierResult(labelMap
>>> .get(bestIdx), bestScore);
>>>=20
>>>       analyzer.addInstance(pair.getFirst().toString(),
>> classifierResult);
>>>=20
>>>     }
>>>=20
>>>=20
>>> your out file will contain our document name suppose 1 and label =
vector
>>> with its values.
>>>=20
>>>=20
>>> hope this help.
>>>=20
>>> Thanks,
>>>=20
>>> Vaibhav
>>>=20
>>> vaibhavcse30@gmail.com
>>>=20
>>>=20
>>>=20
>>>=20
>>> On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <
>> luca.filipponi89@gmail.com>
>>> wrote:
>>>=20
>>>> I am using mahout 0.9, which part of source code should I look?
>>>>=20
>>>> My problem is that I don't know how to the sequence file without =
the
>> label
>>>> should be structured.
>>>>=20
>>>> Do you have any hint?
>>>>=20
>>>> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
>>>> vaibhavcse30@gmail.com> ha scritto:
>>>>=20
>>>>> Hi,
>>>>> If you want to create a test set and if you do not want to measure
>>>> accuracy.
>>>>> Then you can make an instance of claasifier and load your model on =
that
>>>>> classifier and then can find the best score.
>>>>> Look at  navie bayes test code.
>>>>> Hope this help. Thanks .
>>>>> On 29 Jul 2014 12:53, "Luca Filipponi" =
<luca.filipponi89@gmail.com>
>>>> wrote:
>>>>>=20
>>>>>> Hi , I am trying to develop sentiment analysis on italian tweet =
from
>>>>>> twitter using the naive bayes classifier, but I've some trouble.
>>>>>>=20
>>>>>> My idea was to classify a lot of tweet as positive, negative or
>>>> neautral,
>>>>>> and using that as training set for the Classifier. To do that =
I've
>>>> wrote a
>>>>>> sequence file, in the format <Text,Text>, where in the key there =
is
>>>>>> /label/tweetID and in the key the text, and then the text of all =
the
>>>>>> dataset is converted in tfidf vector, using mahout utilities.
>>>>>>=20
>>>>>> Then I'm using the command:
>>>>>>=20
>>>>>> ./mahout trainnb and ./mahout testnb to check the classifier, and =
the
>>>>>> score is right (I've got nearly 100% because the test set is the =
same
>> as
>>>>>> the train set)
>>>>>>=20
>>>>>> My question is if I want to use a test set that is unlabeled how
>> should
>>>> it
>>>>>> be created? because if the format isn't like:
>>>>>>=20
>>>>>> key =3D /label/  the classifier can't find the label and I've got =
an
>>>>>> exception
>>>>>>=20
>>>>>> but in a new dataset, obviously this will be unlabeled because i =
need
>> to
>>>>>> classify that, so I don't know what put in the key of the =
sequence
>> file.
>>>>>>=20
>>>>>> I've searched online for some example, but the only ones that =
I've
>> found
>>>>>> use the split command, on the original dataset, and then testing =
on
>>>> part of
>>>>>> that, but isn't my case.
>>>>>>=20
>>>>>>=20
>>>>>> Every idea for developing a better sentiment analysis is welcome,
>> thanks
>>>>>> in advance for the help.
>>>>>>=20
>>>>>>=20
>>>>=20
>>>>=20
>>>=20
>>>=20
>>> --
>>> Thanks and Regards,
>>> Vaibhav Srivastava
>>> Email-id: vaibhavcse30@gmail.com
>>> Mobile no.: 9552543029
>>=20
>>=20
>=20
>=20
> --=20
> Thanks and Regards,
> Vaibhav Srivastava
> Email-id: vaibhavcse30@gmail.com
> Mobile no.: 9552543029