mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Palleti, Pallavi" <pallavi.pall...@corp.aol.com>
Subject RE: Text clustering
Date Fri, 26 Dec 2008 10:14:34 GMT
Hi Phillippe,

  Even I had to use an Id for a vector. So, What I did was, I used
KeyValueTextInputFormat as the input format (Default is textinputformat)
and gave the input as ID \t Vector (ID and vector are tab separated) and
made changes at the final display part(runClustering) to consider id too
along with the vector.


Thanks
Pallavi
-----Original Message-----
From: Richard Tomsett [mailto:indigentmartian@gmail.com] 
Sent: Thursday, December 04, 2008 5:16 AM
To: mahout-user@lucene.apache.org
Subject: Re: Text clustering

Hi Phillippe,

I used the K-Means on TF-IDF vectors and wondered the same thing - about
labelling the documents. I haven't got my code on me at the moment and
it
was a few months ago that I last looked at it (so I was also probably
using
an older version of Mahout)... but I seem to remember that I did just as
you
are suggesting and simply attached a unique ID to each document which
got
passed through the map-reduce stages. This requires a bit of tinkering
with
the K-Means implementation but shouldn't be too much work.

As for having massive vectors, you could try representing them as sparse
vectors rather than the dense vectors the standard Mahout K-Means
algorithm
accepts, which gets rid of all the zero values in the document vectors.
See
the Javadoc for details, it'll be more reliable than my memory :-)

Richard


2008/12/3 Philippe Lamarche <philippe.lamarche@gmail.com>

> Hi,
>
> I have a questions concerning text clustering and the current
> K-Means/vectors implementation.
>
> For a school project, I did some text clustering with a subset of the
Enron
> corpus. I implemented a small M/R package that transforms text into
TF-IDF
> vector space, and then I used a little modified version of the
> syntheticcontrol K-Means example. So far, all is fine.
>
> However, the output of the k-mean algorithm is vector, as is the
input. As
> I
> understand it, when text is transformed in vector space, the
cardinality of
> the vector is the number of word in your global dictionary, all word
in all
> text being clustered. This, can grow up pretty quick. For example,
with
> only
> 27000 Enron emails, even when removing word that only appears in 2
emails
> or
> less, the dictionary size is about 45000 words.
>
> My number one problem is this: how can we find out what document a
vector
> is
> representing, when it comes out of the k-means algorithm? My favorite
> solution would be to have a unique id attached to each vector. Is
there
> such
> ID in the vector implementation? Is there a better solution? Is my
approach
> to text clustering wrong?
>
> Thanks for the help,
>
> Philippe.
>

Mime
View raw message