mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay Santhanam <vijay.santha...@gmail.com>
Subject Re: Using naive bayes classification with continuous, categorical and word-like features
Date Tue, 05 Jul 2011 09:33:55 GMT
Hi Ted,

I've uploaded my code to https://gist.github.com/1064551

I bought Mahout in Action and am using your ContinuousValueEncoder and other
misc classes, but as you can see I've hardcoded most of the training data.

Yes, there are very few training samples, but from what I understand, I can
reiterate training with the same data to "strengthen" the model.
But this isn't working out for me.

Thanks for taking a look.

Cheers,
V

On Tue, Jul 5, 2011 at 6:06 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> How many training examples do you have?
>
> Sounds like you have very few.  That is definitely not the sweet spot for
> on-linear regression.
>
> In any case, can you post your test code to github or something?
>
> On Mon, Jul 4, 2011 at 11:46 AM, Vijay Santhanam
> <vijay.santhanam@gmail.com>wrote:
>
> > Thank you Ted
> >
> > However, even with using the default OnlineLogisiticRegression I'm unable
> > to
> > get acceptable results when trying to replicate the gender-guesser
> > discussed
> > in the example of http://en.wikipedia.org/wiki/Naive_Bayes_classifier
> >
> > For that particular problem, do you recommend I take a
> > binning/discretization approach with naive bayes? Or continue trying to
> > fine
> > tune the SGD algorithm?
> >
> > At this stage, I'm just hopelessly guessing parameters
> > for OnlineLogisiticRegression.
> > Even when I reiterate over the same data set many thousands of times I'm
> > unable to get a suitable model that can pick a female or male from a
> > height,weight and shoe size.
> >
> > Thanks again for taking the time to answer me.
> >
> > -V
> >
> >
> > On Tue, Jul 5, 2011 at 4:30 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > The wikipedia page recommends binning if you have a large amount of
> data
> > > and
> > > a supervised variable extraction method if not.  These are both ways of
> > > preprocessing to discretize continuous variables.
> > >
> > > On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > > The mahout implementation of Naive_Bayes does not use continuous
> > > variables
> > > > well.  The best bet is to discretize these variables either
> > individually
> > > or
> > > > together using k-means.  Then use the discrete version for the
> > > classifier.
> > > >
> > > > The random forest implementation and the SGD implementation are both
> > > > happier with continuous variables.
> > > >
> > > >
> > > > On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <
> > > vijay.santhanam@gmail.com
> > > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I'm new to Mahout and many of the machine learning ideas, but from
> > what
> > > I
> > > >> understand of Naive Bayes classifier, it's possible to train a Naive
> > > Bayes
> > > >> model with continuous, categorical and word-like features from my
> > > >> understanding of the wikipedia entry
> > > >> http://en.wikipedia.org/wiki/Naive_Bayes_classifier
> > > >>
> > > >> The 20news and wikipedia examples currently in mahout from what I
> > gather
> > > >> only use a target categorical variable and a text-like variables.
> > > >>
> > > >> I'm trying to replicate the person-gender-guesser used in the
> > wikipedia
> > > >> article using mahout.
> > > >>
> > > >> Can anyone give me any tips about how to:
> > > >> * format input files (train and test) for different data types
> > > >> * inform the trainer and classifier which features are continuous,
> > > >> categorical and word-like
> > > >>
> > > >> My dataset is quite small, so I'd like to be able to process this
in
> > > code
> > > >> (using Vectors, Models, etc), but I'm quite confused about how to
> use
> > > the
> > > >> classifier.bayes packages to train/create model with all my feature
> > > types.
> > > >>
> > > >> Thanks in advance for any guidance.
> > > >>
> > > >> Cheers,
> > > >> --
> > > >>  Vijay Santhanam
> > > >>  Software Engineer
> > > >>  http://au.linkedin.com/in/vijaysanthanam
> > > >>  0407525087
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >  Vijay Santhanam
> >  Software Engineer
> >  http://au.linkedin.com/in/vijaysanthanam
> >  0407525087
> >
>



-- 
 Vijay Santhanam
 Software Engineer
 http://au.linkedin.com/in/vijaysanthanam
 0407525087

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message