mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Logistic Regression Tutorial
Date Thu, 28 Apr 2011 20:54:58 GMT
The TrainNewsGroups class does this not quite as nicely as is possible (it
avoids the TextValueEncoder).

I will post a simplified example on github that I just worked up for RCV1.



On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <chris@cellixis.com> wrote:

> Benson,
>
> Chapter 14 and 15 discuss the 20 newsgroups classification example using
> bad-of-words.  In this implementation of LR, you have to manually create the
> feature vectors when iterating through the files.  The features are hashed
> into a vector of predetermined length.  The examples are very clear and easy
> to setup.  I can send you some code I wrote for a similar problem if it will
> help.
>
> Chris
>
> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
>
> > Chris,
> >
> > I'm looking a recently-purchased MIA.
> >
> > The LR example is all about the donut file, which has features that
> > don't look anything like, even remotely, a full-up bag-of-words
> > vector.
> >
> > I'm lacking the point of connection between the vectorization process
> > (which we have some experience here with running canopy/kmeans) and
> > the LR example. It's probably some simple principle that I'm failing
> > to grasp.
> >
> > --benson
> >
> >
> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <chris@cellixis.com>
> wrote:
> >> Benson,
> >>
> >> The latest chapters in Mahout in Action cover document classification
> using LR very well.
> >>
> >> Chris
> >>
> >>
> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
> >>
> >>> Mike,
> >>>
> >>> in the time available for the experiment I want to perform, all I can
> >>> imagine doing is turning each document into a bag-of-words feature
> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
> >>> train a model. I confess that I don't have the time to try to absorb
> >>> the underlying math, indeed, I have some co-workers who can help me
> >>> with that. My problem is entirely plumbing at this point.
> >>>
> >>> --benson
> >>>
> >>>
> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mike.nute@gmail.com>
> wrote:
> >>>> Benson,
> >>>>
> >>>> Lecture 3 in this one is a good intro to the logit model:
> >>>>
> >>>>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
> >>>>
> >>>> The lecture notes are pretty solid too so that might be faster.
> >>>>
> >>>> The short version: Logistic Regression is a GLM with the link f^-1(x)
> =
> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
> alternatively use
> >>>> Batch or Stochastic Gradient Descent.
> >>>>
> >>>> I've never done document classification before though, so I'm not much
> help
> >>>> with more complicated things like choosing the feature vector.
> >>>>
> >>>> Good Luck,
> >>>> Mike Nute
> >>>>
> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
> bimargulies@gmail.com>wrote:
> >>>>
> >>>>> Is there a logistic regression tutorial in the house? I've got a
> stack
> >>>>> of files (Arabic ones, no less) and I want to train and score a
> >>>>> classifier.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Michael Nute
> >>>> Mike.Nute@gmail.com
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message