Thank you Ted
However, even with using the default OnlineLogisiticRegression I'm unable to
get acceptable results when trying to replicate the genderguesser discussed
in the example of http://en.wikipedia.org/wiki/Naive_Bayes_classifier
For that particular problem, do you recommend I take a
binning/discretization approach with naive bayes? Or continue trying to fine
tune the SGD algorithm?
At this stage, I'm just hopelessly guessing parameters
for OnlineLogisiticRegression.
Even when I reiterate over the same data set many thousands of times I'm
unable to get a suitable model that can pick a female or male from a
height,weight and shoe size.
Thanks again for taking the time to answer me.
V
On Tue, Jul 5, 2011 at 4:30 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The wikipedia page recommends binning if you have a large amount of data
> and
> a supervised variable extraction method if not. These are both ways of
> preprocessing to discretize continuous variables.
>
> On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > The mahout implementation of Naive_Bayes does not use continuous
> variables
> > well. The best bet is to discretize these variables either individually
> or
> > together using kmeans. Then use the discrete version for the
> classifier.
> >
> > The random forest implementation and the SGD implementation are both
> > happier with continuous variables.
> >
> >
> > On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <
> vijay.santhanam@gmail.com
> > > wrote:
> >
> >> Hi,
> >>
> >> I'm new to Mahout and many of the machine learning ideas, but from what
> I
> >> understand of Naive Bayes classifier, it's possible to train a Naive
> Bayes
> >> model with continuous, categorical and wordlike features from my
> >> understanding of the wikipedia entry
> >> http://en.wikipedia.org/wiki/Naive_Bayes_classifier
> >>
> >> The 20news and wikipedia examples currently in mahout from what I gather
> >> only use a target categorical variable and a textlike variables.
> >>
> >> I'm trying to replicate the persongenderguesser used in the wikipedia
> >> article using mahout.
> >>
> >> Can anyone give me any tips about how to:
> >> * format input files (train and test) for different data types
> >> * inform the trainer and classifier which features are continuous,
> >> categorical and wordlike
> >>
> >> My dataset is quite small, so I'd like to be able to process this in
> code
> >> (using Vectors, Models, etc), but I'm quite confused about how to use
> the
> >> classifier.bayes packages to train/create model with all my feature
> types.
> >>
> >> Thanks in advance for any guidance.
> >>
> >> Cheers,
> >> 
> >> Vijay Santhanam
> >> Software Engineer
> >> http://au.linkedin.com/in/vijaysanthanam
> >> 0407525087
> >>
> >
> >
>

Vijay Santhanam
Software Engineer
http://au.linkedin.com/in/vijaysanthanam
0407525087
