mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach Richardson <z...@raveldata.com>
Subject Re: SGD vs Naive Bayes for classification
Date Fri, 09 Sep 2011 17:54:12 GMT
Hi Loic,

In my experience, when dealing with smaller datasets (i.e. the number of
training examples you have is less than, let's say 1000, or even less than
100 per category).  That a Linear SVM tends to perform better than Mahout's
SGD.

I would either recommend using Rapid Miner if you want a pretty gui and some
configurable text import tools, or liblinear/libsvm from the command line.
 The former will let you iterate quickly on what you are trying to do
without any custom coding.  However, depending on how you want to deploy
this, you might need to stick with liblinear / libsvm (rapidminer uses the
libsvm library internally) for the true "deployable" system since the
Rapidminer libraries are all AGPL.

You can find examples for either online.  If you still are having problems,
I would be more than happy to share a rapidminer pipeline for processing
documents, training a classifier, etc.

Zach

On Fri, Sep 9, 2011 at 12:16 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> On Fri, Sep 9, 2011 at 8:41 AM, Loic Descotte <loic.descotte@kelkoo.com
> >wrote:
>
> > ... My goal is to make prediction on thousands of text entries, but with
> > smaller as possible learning datas (categories may often change so I will
> > not always have hundreds of entries for training on each category).
> >
>
> This is very small with respect to Mahout algorithms.  There may be better
> options.  The standard choice for small text datasets like this is linear
> SVM, but SGD should work reasonably well.  Naive Bayes may not work as well
> with such a small amount of training data.  I would avoid the adaptive SGD
> and tune the training parameters by hand.
>
> Another question, in all exemples I've found, Naive Bayes is used to
> analyze
> > sets containing a lot keywords, and to classify them in the right
> category
> > (e.g wikipedia examples : https://www.ibm.com/**
> > developerworks/java/library/j-**mahout/#N10412<
> https://www.ibm.com/developerworks/java/library/j-mahout/#N10412>).
> >
> > SGD example are a little different, instead of working on word sequences,
> > they use many predictors values and each predictor has only one value for
> > each entry.
> >
>
> That is true in Chapter 13 where SGD is introduced.  Later chapters
> illustrate the use on the 20 newsgroups data.
>
>
> > Is it possible to use the SGD algorythm (maybe better for me because I
> have
> > small datasets) with only text (like blog posts) entries ?
> >
>
> Yes.  This should work fine.
>
> I would consider also the Luduan algorithm which is not currently part of
> Mahout, although all the pieces are there.
>
> The basic idea is that for each binary decision you have three kinds of
> documents.  These are unjudged documents, judged relevant documents and
> judged non-relevant.  Luduan uses log-likelihood ratio test to compare the
> judged relevant and judged non-relevant sets.  This comparison gives a set
> of search terms that are used with standard retrieval weighting such as
> tf-idf or BM-25.  Term weights are determined by corpus frequencies without
> any explicit reference to the frequencies in the judged relevant or
> non-relevant documents.
>
> For some classification tasks with modest sized training data, this method
> out-performs most others.
>
> I can send a PDF with a more detailed description.
>
>
> > Thanks a lot for your time, tell me if I'm not clear enough in my
> > explainations :)
> >
>
> Please tell me the same.
>



-- 
Zach Richardson
Ravel, Co-founder
Austin, TX
zach@raveldata.com
512.825.6031

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message