On Wed, 2 Sep 2009 14:38:54 -0700 Grant Ingersoll wrote: > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html I have followed the tutorial and was able to run lda on the reuters dataset. Some questions that occurred to me: Looking at the resulting topics it seems like no stemming or lemmatization has been done prior to generating the vectors. Is that right? Do we have documentation on the vector format? I found http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that describes how to generate vectors from Lucene. I would like to run MAHOUT-123 on a set of vectors generated from German texts. We already have a document processing pipeline that is capable of tokenisation, stemming, term selection and the like that I would like to reuse. I guess I could reuse the org.apache.mahout.utils.vector.* classes? Isabel