mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Using naive bayes classification with continuous, categorical and word-like features
Date Mon, 04 Jul 2011 18:30:46 GMT
The wikipedia page recommends binning if you have a large amount of data and
a supervised variable extraction method if not.  These are both ways of
preprocessing to discretize continuous variables.

On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The mahout implementation of Naive_Bayes does not use continuous variables
> well.  The best bet is to discretize these variables either individually or
> together using k-means.  Then use the discrete version for the classifier.
>
> The random forest implementation and the SGD implementation are both
> happier with continuous variables.
>
>
> On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <vijay.santhanam@gmail.com
> > wrote:
>
>> Hi,
>>
>> I'm new to Mahout and many of the machine learning ideas, but from what I
>> understand of Naive Bayes classifier, it's possible to train a Naive Bayes
>> model with continuous, categorical and word-like features from my
>> understanding of the wikipedia entry
>> http://en.wikipedia.org/wiki/Naive_Bayes_classifier
>>
>> The 20news and wikipedia examples currently in mahout from what I gather
>> only use a target categorical variable and a text-like variables.
>>
>> I'm trying to replicate the person-gender-guesser used in the wikipedia
>> article using mahout.
>>
>> Can anyone give me any tips about how to:
>> * format input files (train and test) for different data types
>> * inform the trainer and classifier which features are continuous,
>> categorical and word-like
>>
>> My dataset is quite small, so I'd like to be able to process this in code
>> (using Vectors, Models, etc), but I'm quite confused about how to use the
>> classifier.bayes packages to train/create model with all my feature types.
>>
>> Thanks in advance for any guidance.
>>
>> Cheers,
>> --
>>  Vijay Santhanam
>>  Software Engineer
>>  http://au.linkedin.com/in/vijaysanthanam
>>  0407525087
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message