mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Hall (JIRA)" <>
Subject [jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation
Date Mon, 03 Aug 2009 01:47:15 GMT


David Hall commented on MAHOUT-123:

I unfortunately haven't run it on a Hadoop cluster yet. It should "just work" if you run it
with the right Hadoop configuration. Shouldn't running it through the "hadoop" shell script
add the configuration?

I'll get it running on a hadoop cluster soon.

The code actually requires Hadoop 0.20, because Mahout has decided to move in that direction.

-- David

> Implement Latent Dirichlet Allocation
> -------------------------------------
>                 Key: MAHOUT-123
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: David Hall
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>         Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch,
MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch
>   Original Estimate: 504h
>  Remaining Estimate: 504h
> (For GSoC)
> Abstract:
> Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
> algorithm for automatically and jointly clustering words into "topics"
> and documents into mixtures of topics, and it has been successfully
> applied to model change in scientific fields over time (Griffiths and
> Steyver, 2004; Hall, et al. 2008). In this project, I propose to
> implement a distributed variant of Latent Dirichlet Allocation using
> MapReduce, and, time permitting, to investigate extensions of LDA and
> possibly more efficient algorithms for distributed inference.
> Detailed Description:
> A topic model is, roughly, a hierarchical Bayesian model that
> associates with each document a probability distribution over
> "topics", which are in turn distributions over words. For instance, a
> topic in a collection of newswire might include words about "sports",
> such as "baseball", "home run", "player", and a document about steroid
> use in baseball might include "sports", "drugs", and "politics". Note
> that the labels "sports", "drugs", and "politics", are post-hoc labels
> assigned by a human, and that the algorithm itself only assigns
> associate words with probabilities. The task of parameter estimation
> in these models is to learn both what these topics are, and which
> documents employ them in what proportions.
> One of the promises of unsupervised learning algorithms like Latent
> Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
> massive collections of documents and condense them down into a
> collection of easily understandable topics. However, all available
> open source implementations of LDA and related topics models are not
> distributed, which hampers their utility. This project seeks to
> correct this shortcoming.
> In the literature, there have been several proposals for paralellzing
> LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
> which each processors gets its own subset of the documents to run
> Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
> its very nature, which is not advantageous for repeated runs. Instead,
> I propose to follow Nallapati, et al. (2007) and use a variational
> approximation that is fast and non-random.
> References:
> David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
> David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
> allocation, The Journal of Machine Learning Research, 3, p.993-1022,
> 3/1/2003
> T. L. Griffiths and M. Steyvers. Finding scientiļ¬c topics. Proc Natl
> Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
> David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
> the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
> Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
> variational EM for Latent Dirichlet Allocation: An experimental
> evaluation of speed and scalability, ICDM workshop on high performance
> data mining, 2007.
> Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
> Inference for Latent Dirichlet Allocation. NIPS, 2007.
> Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
> continuous-time model of topical trends. KDD, 2006
> Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
> large datasets. ICML, 2008.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message