mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Using naive bayes classification with continuous, categorical and word-like features
Date Tue, 05 Jul 2011 08:06:13 GMT
How many training examples do you have?

Sounds like you have very few.  That is definitely not the sweet spot for
on-linear regression.

In any case, can you post your test code to github or something?

On Mon, Jul 4, 2011 at 11:46 AM, Vijay Santhanam
<vijay.santhanam@gmail.com>wrote:

> Thank you Ted
>
> However, even with using the default OnlineLogisiticRegression I'm unable
> to
> get acceptable results when trying to replicate the gender-guesser
> discussed
> in the example of http://en.wikipedia.org/wiki/Naive_Bayes_classifier
>
> For that particular problem, do you recommend I take a
> binning/discretization approach with naive bayes? Or continue trying to
> fine
> tune the SGD algorithm?
>
> At this stage, I'm just hopelessly guessing parameters
> for OnlineLogisiticRegression.
> Even when I reiterate over the same data set many thousands of times I'm
> unable to get a suitable model that can pick a female or male from a
> height,weight and shoe size.
>
> Thanks again for taking the time to answer me.
>
> -V
>
>
> On Tue, Jul 5, 2011 at 4:30 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > The wikipedia page recommends binning if you have a large amount of data
> > and
> > a supervised variable extraction method if not.  These are both ways of
> > preprocessing to discretize continuous variables.
> >
> > On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > The mahout implementation of Naive_Bayes does not use continuous
> > variables
> > > well.  The best bet is to discretize these variables either
> individually
> > or
> > > together using k-means.  Then use the discrete version for the
> > classifier.
> > >
> > > The random forest implementation and the SGD implementation are both
> > > happier with continuous variables.
> > >
> > >
> > > On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <
> > vijay.santhanam@gmail.com
> > > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm new to Mahout and many of the machine learning ideas, but from
> what
> > I
> > >> understand of Naive Bayes classifier, it's possible to train a Naive
> > Bayes
> > >> model with continuous, categorical and word-like features from my
> > >> understanding of the wikipedia entry
> > >> http://en.wikipedia.org/wiki/Naive_Bayes_classifier
> > >>
> > >> The 20news and wikipedia examples currently in mahout from what I
> gather
> > >> only use a target categorical variable and a text-like variables.
> > >>
> > >> I'm trying to replicate the person-gender-guesser used in the
> wikipedia
> > >> article using mahout.
> > >>
> > >> Can anyone give me any tips about how to:
> > >> * format input files (train and test) for different data types
> > >> * inform the trainer and classifier which features are continuous,
> > >> categorical and word-like
> > >>
> > >> My dataset is quite small, so I'd like to be able to process this in
> > code
> > >> (using Vectors, Models, etc), but I'm quite confused about how to use
> > the
> > >> classifier.bayes packages to train/create model with all my feature
> > types.
> > >>
> > >> Thanks in advance for any guidance.
> > >>
> > >> Cheers,
> > >> --
> > >>  Vijay Santhanam
> > >>  Software Engineer
> > >>  http://au.linkedin.com/in/vijaysanthanam
> > >>  0407525087
> > >>
> > >
> > >
> >
>
>
>
> --
>  Vijay Santhanam
>  Software Engineer
>  http://au.linkedin.com/in/vijaysanthanam
>  0407525087
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message