mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Isabel Drost <isa...@apache.org>
Subject Re: LDA tutorial?
Date Thu, 03 Sep 2009 14:31:15 GMT
On Wed, 2 Sep 2009 14:38:54 -0700
Grant Ingersoll <gsingers@apache.org> wrote:

> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I have followed the tutorial and was able to run lda on the reuters
dataset. Some questions that occurred to me:

Looking at the resulting topics it seems like no stemming or
lemmatization has been done prior to generating the vectors. Is that
right?

Do we have documentation on the vector format? I found 
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that
describes how to generate vectors from Lucene. I would like to run
MAHOUT-123 on a set of vectors generated from German texts. We already
have a document processing pipeline that is capable of tokenisation,
stemming, term selection and the like that I would like to reuse. I
guess I could reuse the org.apache.mahout.utils.vector.*
classes?

Isabel

Mime
View raw message