mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Lawrence <>
Subject Sequence file format for Kmeans, LDA, etc.
Date Fri, 13 Nov 2009 01:57:10 GMT

I'm trying to write a map-reduce program that will convert text documents into a format suitable
for Mahout's clustering algorithms. From what I can gather, it seems like the output should
be a sequence file with a long integer document index (key) and a sparse vector (value) that
contains TF (or TFIDF) counts. This sparse vector also has a name that identifies the document.

Does the long integer document index matter? I would rather avoid having to set this to something
meaningful. Do the numbers have to be unique or contiguous? Does the name of the sparse vector
matter? I noticed that it is being set as a string in LuceneIterable.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message