mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lancaster, Robert (Orbitz)" <ROBERT.LANCAS...@orbitz.com>
Subject RE: NaiveBayes and Classification of non-documents
Date Thu, 02 Jun 2011 15:28:15 GMT
I can easily turn the numerics into a collection of Booleans.  However, I'm unclear what the
input file should look like for NB in such a case.  Does anyone have an example of how such
a dataset would look?

I was able to get SGD to work for a small number of records (about 800k) which took about
40 minutes.  I'm afraid of how long that would take with the full 80 million.  It appears
that SGD runs entirely locally. Although I knew that is a sequential algorithm, I was expecting
it to utilize the cluster for parallelizing cross-validation, etc.  


-----Original Message-----
From: Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Thursday, June 02, 2011 10:14 AM
To: user@mahout.apache.org
Subject: Re: NaiveBayes and Classification of non-documents

NB implementation doesnt handle numeric values very well, if you convert
your data to boolean feature. You can construct a document out of it and use
it on NB

A better way would be to use Weka formatter to convert to vectors and use
the SGD classifier in Mahout. You will be pleasantly surprised by its
accuracy and speed.

Robin


On Thu, Jun 2, 2011 at 8:18 PM, Lancaster, Robert (Orbitz) <
ROBERT.LANCASTER@orbitz.com> wrote:

> I'm looking at the Mahout implementation NaiveBayes for a classification
> task, but the language around the Mahout implementation appears to be
> document-centric.  Is it possible to use the Mahout implementation of NB for
> a classification task that doesn't involve documents?
>
> I have about 80 million records with a small number of features.  The arff
> header looks like (the numeric features could easily be nominalized if need
> be):
>
> @RELATION        relation
> @ATTRIBUTE      featurea    NUMERIC
> @ATTRIBUTE      featureb    {1,2,3,4,5,6,7}
> @ATTRIBUTE      featurec     {1,2,3,4,5,6,7}
> @ATTRIBUTE      featured     NUMERIC
> @ATTRIBUTE      featuref        NUMERIC
> @ATTRIBUTE      featuref {0,1}
> @ATTRIBUTE      target  {0,1}
>
Mime
View raw message