mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Suggestions for best approach to classic document clustering
Date Thu, 11 Feb 2010 02:04:17 GMT
Hi all,

Give the code currently in Mahout (+ Lucene), is there a generally  
accepted best approach for clustering of documents?

Assumptions are small document sets (e.g. a few thousand), with  
documents being representative data from web pages, all in English.

I've been fooling around with a few different combinations, e.g. pre- 
processing the documents to extract keywords and using these for  
clustering w/k-means, canopy, mean-shift canopy.

But before I have too much fun twiddling all the dials, it would be  
great to get input on good/bad options.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message