mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Ingles <p...@oobaloo.co.uk>
Subject Document Clustering
Date Tue, 07 Jul 2009 21:37:25 GMT
Hi,

I've been doing some reading through the archives to search for some  
inspiration with a problem I've been attempting to solve at work, and  
was hoping I could share where my head's at and get some pointers for  
where to go. Since we're looking at clustering between 17m and 70m  
documents, we're looking to implement this in Hadoop.

We're trying to build clusters of (relatively small) documents, based  
on the words/terms they contain. Clusters will probably range in size  
between 10 and 1000 documents. Clusters should ultimately be populated  
by documents that contain almost identical terms (nothing clever like  
stemming done so far).

So far, I've been working down the pairwise similarity route. So,  
using a few MapReduce jobs we produce something along the lines of the  
following:

Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:

With a row per-document, containing a vector of tuples for related  
documents, and a similarity score. Viewed another way, it's the  
typical matrix:

                A        B       C
-----+------+------+-----+
A    |          |   0.1 |   0.4
B    |          |          |   0.5

etc. A higher number means a more closely related document.

I've been trying to get my head around how to cluster these in a set  
of MapReduce jobs and I'm not quite sure of how to proceed: the  
examples I've read around kmeans, canopy clustering etc. all seem to  
work on multidimensional (numerical) data. Given the data above, is it  
even possible to adapt the algorithm? The potential centroids in the  
example above would be the documents, and I just can't get my head  
around applying the algorithms to this kind of data.

I guess the alternative would be to step back to producing a matrix of  
terms x documents:

             Ta     Tb     Tc
-----+------+------+-----+
Da  |          |   0.1 |   0.4
Db  |          |          |   0.5

And then cluster based on this? This seems similar in structure to the  
user x movie recommendation matrix that's often used?

I'd really appreciate people's thoughts- am I thinking the right way?  
Is there a better way?

Thanks,
Paul

Mime
View raw message