mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ivan obeso <sendero.lumin...@gmail.com>
Subject Re: LDA input
Date Mon, 30 Apr 2012 07:26:41 GMT
Yes. You make a sequencial file using, for example, the SequenceFile.Writer
class writing the name of the file as key, and all the content as the
value. You can write as files as you want into the sequence file.

Then, you use this *.seq as a input for DocumentProcessor.tokenizeDocuments
to tokenize this file (you can use here a stemmer). The result of this is a
folder with the files containing the tokens. This folder must be the input
of the DictionaryVectorizer.createTermFrequencyVectors class to create the
TFvectors of the corpus. Finally, this folder is the input of the LDA
algotithm that you can use with the "bin/mahout lda" tool, or calling it
from a java program.

It's not necesary clustering for running the lda algorithm, because it
makes a clustering process itself.

[https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html]

On Sun, Apr 29, 2012 at 1:11 AM, Aneesha <aneeshatvm@gmail.com> wrote:

> I create sequential file and create vector for k-means. Is it the same
> input we
> need to use for Latent Dirichlet Allocation????
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message