mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Sequence file format for Kmeans, LDA, etc.
Date Fri, 13 Nov 2009 19:54:14 GMT

On Nov 12, 2009, at 8:57 PM, Gregory Lawrence wrote:

> Hi,
> 
> I'm trying to write a map-reduce program that will convert text documents into a format
suitable for Mahout's clustering algorithms. From what I can gather, it seems like the output
should be a sequence file with a long integer document index (key) and a sparse vector (value)
that contains TF (or TFIDF) counts. This sparse vector also has a name that identifies the
document.
> 
> Does the long integer document index matter?

No

> I would rather avoid having to set this to something meaningful. Do the numbers have
to be unique or contiguous?

This is ignored in the clustering

> Does the name of the sparse vector matter?

Yes, as it is part of the equals() method.

> I noticed that it is being set as a string in LuceneIterable.

Right.  You should be able to model after LuceneIterable and the Driver program there.

Also, take a look at what the TfIdfDriver does for the classifier stuff.  This is a M/R job
for converting text for it's format.  I think we can abstract that to be more general purpose
and then move it under the Utils module.  The only thing that likely needs to change is whether
we output the Writable for the classifier or whether we output a Vector.  That is my naive
view at this point.

-Grant


Mime
View raw message