mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Help with Mahout Classification
Date Thu, 03 Feb 2011 16:43:28 GMT
This usually means that something is wrong in the data or the classifier
itself.

Do you have some sample data?

On Thu, Feb 3, 2011 at 3:55 AM, Claudia Grieco <grieco@crmpa.unisa.it>wrote:

> Thanks for your help.
> I've tried implementing several "boolean" classifiers ("sport" or "not
> sport") but they don't seem to work very well (they tend to classify
> everything as "positive" or everything as "negative"). Do you think that for
> it to return meaningful classifications, the model should be trained with an
> almost equal amount of "positive" and "negative" data?
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: lunedì 31 gennaio 2011 16.43
> A: user@mahout.apache.org
> Oggetto: Re: Help with Mahout Classification
>
> For 50 categories, yes.  For 5000, no.
>
> If you have 50 categories, you probably also have inter-category
> constraints
> (i.e. cannot be about football but not sports).
>
> To deal with that, training 50 independent models and then training 50
> models that get to use the output of the first 50 models as inputs might
> help (haven't tried this sort of thing for several years).
>
>
> On Mon, Jan 31, 2011 at 2:55 AM, Claudia Grieco <grieco@crmpa.unisa.it
> >wrote:
>
> > Hi,
> > Just one more question about the SGD classifier.
> > When you say " train one classifier per category" it means that for every
> > possible tag (ex. sport) I should create a classifier that classifies it
> as
> > "sport" or "not sport"? (sorry, English is not my first language)
> > Do you think this approach is feasible for many categories (let's say
> 50)?
> > Thanks again
> > Claudia
> >
> > -----Messaggio originale-----
> > Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Inviato: venerdì 14 gennaio 2011 17.32
> > A: user@mahout.apache.org
> > Oggetto: Re: Help with Mahout Classification
> >
> > If you don't have truly massive volumes, then SGD is almost certainly a
> > better choice because it is simpler.
> >
> > If you have more than 10 million training examples *per*model* and
> > *after*downsampling* then you should consider alternatives but even up to
> > about 50 million training examples, SGD will do very well.  SGD is
> > currently
> > also mostly appropriate for sparse feature vectors.
> >
> > Having multiple categories isn't a big deal.  The simplest solution is to
> > train a classifier per category.  There are more advanced arrangements,
> > though.  For instance, you can train one classifier per category (the
> first
> > level models), then train another classifier per category where the
> inputs
> > are the outputs of the first level models.  Which techniques will help is
> > highly dependent on your particular problem.
> >
> > On Fri, Jan 14, 2011 at 7:10 AM, Claudia Grieco <grieco@crmpa.unisa.it
> > >wrote:
> >
> > > Do you think SGD will be a better choice? New documents are added to
> the
> > > training set very often and documents can belong to more than one
> > category
> > > (ex. "sport", "italy")
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message