mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Token filtering and LDA quality
Date Wed, 25 Jan 2012 00:04:29 GMT
On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <john@iamjohn.me> wrote:

> Hey Jake,
> Thanks for the tips.  That will definitely help.
>
> One more question, do you know if the topic model quality will be affected
> by the document length?


Yes, very much so.


>  I'm thinking lengths ranging from tweets (~20 words),


Tweets suck.  Trust me on this. ;)


> to emails (hundreds of words),


Fantastic size.


> to whitepapers (thousands of words)
>

Can be pretty great too.


> to books (boat loads of words).


This is too long.  There will be tons and tons of topics in a book, often.
But, frankly, I have not tried with huge documents personally, so I can't
say from experience that it won't work.  I'd just not be terribly surprised
if it didn't work well at all.  If I had a bunch of books I wanted to run
LDA
on, I'd maybe treat each page or each chapter as a separate document.

  -jake

 What lengths'ish would degrade topic model
> quality.
>
> I would think tweets would kind'a suck, but what about longer docs?  Should
> they be segmented into sub-documents?
>
> Thanks,
> JohnC
>
>
> On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > Hi John,
> >
> >  I'm not an expert in the field, but I have done a bit of work building
> > topic
> > models with LDA, and here are some of the "tricks" I've used:
> >
> >  1) yes remove stop words, in fact remove all words occurring in more
> than
> > (say) half (or more conservatively, 90%) of your documents, as they'll be
> > noise
> > and just dominate your topics.
> >
> >  2) more features is better, if you have the memory for it (note that
> > mahout's
> > LDA currently holds numTopics * numFeatures in memory in the mapper
> tasks,
> > which means that you are usually bounded to a few hundred thousand
> > features,
> > maybe up as high as a million, currently).  So don't stem, and throw in
> > commonly occurring (or more importantly: high log-likelihood) bigrams and
> > trigrams as independent features.
> >
> >  3) violate the underlying assumption of LDA, that you're talking about
> > "token
> > occurrences", and weight your vectors not as "tf", but "tf*idf", which
> > makes rarer
> > features more prominent, which ends up making your topics look a lot
> nicer.
> >
> > Those are the main tricks I can think of right now.
> >
> > If you're using Mahout trunk, try the new LDA impl:
> >
> >  $MAHOUT_HOME/bin/mahout cvb0 --help
> >
> > It operates on the same kind of input as the last one (ie. a corpus which
> > is
> > a SequenceFile<IntWritable, VectorWritable>).
> >
> >  -jake
> >
> > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <john@iamjohn.me> wrote:
> >
> > > I'm trying to find out if there are any standard best practices for
> > > document tokenization when prepping your data for LDA in order to get a
> > > higher quality topic model, and to understand how the feature space
> > affects
> > > topic model quality.
> > >
> > > For example, will the topic model be "better" if there is a more rich
> > > feature space by not stemming terms, or is it better to have a more
> > > normalized feature space by applying stemming?
> > >
> > > Is it better to filter out stop words, or keep them in?
> > >
> > > Is it better to include bi and/or tri grams of highly correlated terms
> in
> > > the feature space?
> > >
> > > In essence what characteristics of the feature space that LDA uses for
> > > input will create a higher quality topic model.
> > >
> > > Thanks,
> > > JohnC
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message