mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Tomsett" <>
Subject Re: Text clustering
Date Wed, 03 Dec 2008 23:46:28 GMT
Hi Phillippe,

I used the K-Means on TF-IDF vectors and wondered the same thing - about
labelling the documents. I haven't got my code on me at the moment and it
was a few months ago that I last looked at it (so I was also probably using
an older version of Mahout)... but I seem to remember that I did just as you
are suggesting and simply attached a unique ID to each document which got
passed through the map-reduce stages. This requires a bit of tinkering with
the K-Means implementation but shouldn't be too much work.

As for having massive vectors, you could try representing them as sparse
vectors rather than the dense vectors the standard Mahout K-Means algorithm
accepts, which gets rid of all the zero values in the document vectors. See
the Javadoc for details, it'll be more reliable than my memory :-)


2008/12/3 Philippe Lamarche <>

> Hi,
> I have a questions concerning text clustering and the current
> K-Means/vectors implementation.
> For a school project, I did some text clustering with a subset of the Enron
> corpus. I implemented a small M/R package that transforms text into TF-IDF
> vector space, and then I used a little modified version of the
> syntheticcontrol K-Means example. So far, all is fine.
> However, the output of the k-mean algorithm is vector, as is the input. As
> I
> understand it, when text is transformed in vector space, the cardinality of
> the vector is the number of word in your global dictionary, all word in all
> text being clustered. This, can grow up pretty quick. For example, with
> only
> 27000 Enron emails, even when removing word that only appears in 2 emails
> or
> less, the dictionary size is about 45000 words.
> My number one problem is this: how can we find out what document a vector
> is
> representing, when it comes out of the k-means algorithm? My favorite
> solution would be to have a unique id attached to each vector. Is there
> such
> ID in the vector implementation? Is there a better solution? Is my approach
> to text clustering wrong?
> Thanks for the help,
> Philippe.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message