Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of turbocodr@gmail.com designates
 209.85.213.170 as permitted sender)
MIME-Version: 1.0
Sender: turbocodr@gmail.com
In-Reply-To: 
 <CACYXym-9=xBeHJaWPRZUfETLRRadc=zytOUD9SmJpgY0wdTusA@mail.gmail.com>
References: 
 <CALH6cCNpF=Me0wmnDBg+xFpSVb8b1sXOFMGrh4XFb+DPKiY3og@mail.gmail.com>
	<CACYXym-9=xBeHJaWPRZUfETLRRadc=zytOUD9SmJpgY0wdTusA@mail.gmail.com>
Date: Tue, 24 Jan 2012 15:41:33 -0800
Message-ID: 
 <CALH6cCMSKHvhbzCw+LB8K304khUUkRMFDBaYdY9rx6ann2ohag@mail.gmail.com>
Subject: Re: Token filtering and LDA quality
From: John Conwell <john@iamjohn.me>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=20cf30563c21ebe76704b74eaf7b

--20cf30563c21ebe76704b74eaf7b
Content-Type: text/plain; charset=ISO-8859-1

Hey Jake,
Thanks for the tips.  That will definitely help.

One more question, do you know if the topic model quality will be affected
by the document length?  I'm thinking lengths ranging from tweets (~20
words), to emails (hundreds of words), to whitepapers (thousands of words)
to books (boat loads of words).  What lengths'ish would degrade topic model
quality.

I would think tweets would kind'a suck, but what about longer docs?  Should
they be segmented into sub-documents?

Thanks,
JohnC


On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> Hi John,
>
>  I'm not an expert in the field, but I have done a bit of work building
> topic
> models with LDA, and here are some of the "tricks" I've used:
>
>  1) yes remove stop words, in fact remove all words occurring in more than
> (say) half (or more conservatively, 90%) of your documents, as they'll be
> noise
> and just dominate your topics.
>
>  2) more features is better, if you have the memory for it (note that
> mahout's
> LDA currently holds numTopics * numFeatures in memory in the mapper tasks,
> which means that you are usually bounded to a few hundred thousand
> features,
> maybe up as high as a million, currently).  So don't stem, and throw in
> commonly occurring (or more importantly: high log-likelihood) bigrams and
> trigrams as independent features.
>
>  3) violate the underlying assumption of LDA, that you're talking about
> "token
> occurrences", and weight your vectors not as "tf", but "tf*idf", which
> makes rarer
> features more prominent, which ends up making your topics look a lot nicer.
>
> Those are the main tricks I can think of right now.
>
> If you're using Mahout trunk, try the new LDA impl:
>
>  $MAHOUT_HOME/bin/mahout cvb0 --help
>
> It operates on the same kind of input as the last one (ie. a corpus which
> is
> a SequenceFile<IntWritable, VectorWritable>).
>
>  -jake
>
> On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <john@iamjohn.me> wrote:
>
> > I'm trying to find out if there are any standard best practices for
> > document tokenization when prepping your data for LDA in order to get a
> > higher quality topic model, and to understand how the feature space
> affects
> > topic model quality.
> >
> > For example, will the topic model be "better" if there is a more rich
> > feature space by not stemming terms, or is it better to have a more
> > normalized feature space by applying stemming?
> >
> > Is it better to filter out stop words, or keep them in?
> >
> > Is it better to include bi and/or tri grams of highly correlated terms in
> > the feature space?
> >
> > In essence what characteristics of the feature space that LDA uses for
> > input will create a higher quality topic model.
> >
> > Thanks,
> > JohnC
> >
>


-- 

Thanks,
John C

--20cf30563c21ebe76704b74eaf7b--