mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Using naive bayes classification with continuous, categorical and word-like features
Date Mon, 04 Jul 2011 18:28:52 GMT
The mahout implementation of Naive_Bayes does not use continuous variables
well.  The best bet is to discretize these variables either individually or
together using k-means.  Then use the discrete version for the classifier.

The random forest implementation and the SGD implementation are both happier
with continuous variables.

On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam
<vijay.santhanam@gmail.com>wrote:

> Hi,
>
> I'm new to Mahout and many of the machine learning ideas, but from what I
> understand of Naive Bayes classifier, it's possible to train a Naive Bayes
> model with continuous, categorical and word-like features from my
> understanding of the wikipedia entry
> http://en.wikipedia.org/wiki/Naive_Bayes_classifier
>
> The 20news and wikipedia examples currently in mahout from what I gather
> only use a target categorical variable and a text-like variables.
>
> I'm trying to replicate the person-gender-guesser used in the wikipedia
> article using mahout.
>
> Can anyone give me any tips about how to:
> * format input files (train and test) for different data types
> * inform the trainer and classifier which features are continuous,
> categorical and word-like
>
> My dataset is quite small, so I'd like to be able to process this in code
> (using Vectors, Models, etc), but I'm quite confused about how to use the
> classifier.bayes packages to train/create model with all my feature types.
>
> Thanks in advance for any guidance.
>
> Cheers,
> --
>  Vijay Santhanam
>  Software Engineer
>  http://au.linkedin.com/in/vijaysanthanam
>  0407525087
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message