Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Latent Dirichlet Allocation (http://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation)
Edited by Grant Ingersoll:

h1. Overview
Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically
and jointly clustering words into "topics" and documents into mixtures of topics, and it has
been successfully applied to model change in scientific fields over time (Griffiths and Steyver,
2004; Hall, et al. 2008).
A topic model is, roughly, a hierarchical Bayesian model that associates with each document
to a probability distribution over
"topics", which are in turn distributions over words. For instance, a topic in a collection
of newswire might include words about "sports", such as "baseball", "home run", "player",
and a document about steroid use in baseball might include "sports", "drugs", and "politics".
Note that the labels "sports", "drugs", and "politics", are posthoc labels assigned by a
human, and that the algorithm itself only assigns associate words with probabilities. The
task of parameter estimation in these models is to learn both what these topics are, and which
documents employ them in what proportions.
Another way to view a topic model is as a generalization of a mixture model, like [Dirichlet
Process Clustering]. Starting from a normal mixture model, in which we have a single global
mixture of several distributions, we instead say that _each_ document has its own mixture
distribution over the globally shared mixture components. Operationally, in Dirichlet Process
Clustering, each document has its own latent variable drawn from a global mixture that specifies
which component it belongs to, while in LDA, each word in each document has its own parameter
drawn from a documentwide mixture.
The idea is that we use a probabilistic mixture of a number of models that we use to explain
some observed data. Each observed data point is assumed to have come from one of the models
in the mixture, but we don't know which. The way we deal with that is to use a socalled
latent parameter which specifies which model each data point came from.
h1. Invocation and Usage
Mahout's implementation of LDA operates on a collection of SparseVectors of word counts. These
word counts should be nonnegative integers, though things will probably work fine if
you use nonnegative reals. (Note that the probabilistic model doesn't make sense if you do!)
To create these vectors, it's recommended that you follow the instructions in [Creating Vectors
From Text], making sure to use TF and not TFIDF as the scorer.
Invocation takes the form:
{{mvn exec:java Dexec.mainClass=org.apache.mahout.clustering.lda.LDADriver Dexec.args="i
<input vectors directory> o <output working directory> k <numTopics> numWords
<number of words> numReducers <number of reducers>"}}
Topic smoothing should generally be about 50/K, where K is the number of topics. The number
of words in the vocabulary can be an upper bound, though it shouldn't be too high (for memory
concerns).
Choosing the number of topics is more art than science, and it's recommended that you try
several values.
h1. Example
A full endtoend example is located in mahout/examples/bin/buildreuters.sh. The script automatically
downloads the Reuters21578 corpus, builds a Lucene index, converts the Lucene index to vectors,
runs LDA, and then prints out a listing of the top 100 words for each topic. All of this is
done in examples/work/.
To adapt the example yourself, you should note that Lucene has specialized support for Reuters,
and that building your own index will require some adaptation. The rest should hopefully not
differ too much.
h1. Parameter Estimation
We use mean field variational inference to estimate the models. Variational inference can
be thought of as a generalization of [EMExpectation Maximization] for hierarchical Bayesian
models. The EStep takes the form of, for each document, inferring the posterior probability
of each topic for each word in each document. We then take the sufficient statistics and emit
them in the form of (log) pseudocounts for each word in each topic. The MStep is simply
to sum these together and (log) normalize them so that we have a distribution over the entire
vocabulary of the corpus for each topic.
In implementation, the EStep is implemented in the Map, and the MStep is executed in the
reduce step, with the final normalization happening as a postprocessing step.
h1. References
[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation.
JMLR. http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf]
Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
