mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <ap....@outlook.com>
Subject RE: Mahout Naive Bayes CSV Classification
Date Mon, 05 May 2014 19:51:48 GMT
Jossef,
Does your training set have any features with a zero value for all instances?

> Date: Mon, 5 May 2014 08:33:37 +0300
> Subject: RE: Mahout Naive Bayes CSV Classification
> From: jossef12@gmail.com
> To: user@mahout.apache.org
> 
> a link to a github gist with my java code and a small sample from the CSV
> i'm using can be found here:
> https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> On May 5, 2014 5:53 AM, "Andrew Palumbo" <ap.dev@outlook.com> wrote:
> 
> > Hi Jossef,
> >
> > I can answer your first two questions for you:
> >
> > > 1) Are these predicted values normal?
> >
> > Yes, negative scores are normal.
> >
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> >
> > That is correct, NaiveBayes uses a winner takes all approach to to class
> > assignment based on the max score across all classes.  ie. :
> >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> >
> > will be classified as 0.
> >
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> >
> > This seems odd.  Is it possible that something is getting dropped in your
> > vectorization process?
> >
> > Could you give a little more information on how you're using this.  Could
> > you please clarify what you're referring to re:  (line 96 in
> > MahoutTest.java)
> >
> > Thanks,
> >
> > Andy
> >
> > > From: jossef12@gmail.com
> > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > To: user@mahout.apache.org; ssc@apache.org
> > >
> > > Hey Sebastian,
> > >
> > > Thanks for your reply.
> > >
> > > a link to a github gist with my java code and a small sample from the CSV
> > > i'm using can be found here:
> > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > >
> > >
> > >
> > > I wrote code to convert the csv data (41 features + class name) to a
> > > RandomAccessSparseVector and appending it into a sequence file
> > >
> > > I successfully managed to create a model from the sequence file and to
> > > run the NaiveBayes classifier with data.
> > >
> > >
> > > My problem is that i get negative results when i call '
> > > classifier.classifyFull'
> > >
> > > e.g. :
> > >
> > >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > {0:-25620.824856365696,1:-31625.63011412386}
> > > {0:-4601.922062356241,1:-5019.98413435188}
> > > {0:-4331.835315861215,1:-4718.881475757016}
> > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > ...
> > > ...
> > >
> > >
> > >
> > >
> > > 1) Are these predicted values normal?
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> > >
> > >
> > > Thanks :)
> > >
> > >
> > >
> > >
> > >
> > > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ssc@apache.org>
> > wrote:
> > >
> > > > Hi Jossef,
> > > >
> > > > You have to vectorize and normalize your data. The input for naive
> > bayes
> > > > is a sequencefile containing a Text object as key (your label) and a
> > > > VectorWritable that holds a vector with the data.
> > > >
> > > > Instructions to run NaiveBayes can be found here:
> > > >
> > > > https://mahout.apache.org/users/classification/bayesian.html
> > > >
> > > > --sebastian
> > > >
> > > >
> > > >
> > > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > > >
> > > >> I have these 2 CSV files:
> > > >>
> > > >>     1. train-set.csv
> > > >>     2. test-set.csv
> > > >>
> > > >>
> > > >> Both of them are in the same structure (with different content) and
> > > >> similar
> > > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > > >>
> > > >> [image: enter image description here]
> > > >>
> > > >> Each column is a feature and the last column - class, is the name
of
> > the
> > > >> class to predict.
> > > >>
> > > >> .
> > > >>
> > > >> *Can anyone please provide a sample code for:*
> > > >>
> > > >>     1. Initializing Naive Bayes with a CSV file (model creation,
> > training,
> > > >>     required pre-processing, etc...)
> > > >>     2. For a given CSV row - predicting a class
> > > >>
> > > >>
> > > >> Thanks!
> > > >>
> > > >> .
> > > >>
> > > >> .
> > > >>
> > > >> BTW -
> > > >>
> > > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow
> > these
> > > >> links:
> > > >>
> > > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > > >>
> > > >> .
> > > >> ‚Äč
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Sincerely,
> >
> > >
> > > Jossef Harush.
> > > jossef.com <http://www.jossef.com>
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message