lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <>
Subject Re: Document Clustering
Date Tue, 08 Feb 2005 09:49:10 GMT

Hi Owen,

> Last year it was suggested Carrot2 could help, and it would even produce 
> good labels for the clusters.  Has this proven to be true?  

Yes, Carrot2 should help you with this. The labels it creates highly 
depend on the quality of the input snippets, but the so-called KWIK 
snippets (keyword in context) should suffice (see David Spencer's 
example with Wikipedia).

There is one thing, though: what is employed in Carrot2 is an on-line 
unsupervised clusterer that is designed to work with small number of 
documents and incomplete descriptions (snippets versus full text 
documents). It will _not_ work for large document collections (thousands 
of documents) simply because it was not designed to do that. I guess
you could try with up to 500 snippets -- beyond that, you'll be waiting 
for the result forever.

There is a great number of algorithms that can cluster large document 
collections -- see proceedings from information retrieval conferences 
for example.

As for David's hints:

 > I'm not sure what the complexity of the algorithm is, but for me ~100 
 > docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

Yes, 100 to 200 snippets is optimal with the open source clustering 
algorithm. We have a refactored and optimized version of the Lingo 
clusterer that is commercial (it also provides hierarchical clustering 
capability as an add-on to the open source component). But even the 
commercial version will only cluster up to 500 -- 1000 snippets. As I 
said, it was not our goal to cluster document collections, rather to 
retrieve useful information from preprocessed snippets.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message