lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Document Clustering
Date Tue, 08 Feb 2005 08:16:12 GMT
Owen Densmore wrote:

> I would like to be able to analyze my document collection (~1200 
> documents) and discover good "buckets" of categories for them.  I'm 
> pretty sure this is termed Document Clustering .. finding the emergent 
> clumps the documents fall naturally into judging from their term vectors.
> Looking at the discussion that flared roughly a year ago (last message 
> 2003-11-12) with the subject Document Clustering, it seems Lucene should 
> be able to help with this.  Has anyone had success with this recently?
> Last year it was suggested Carrot2 could help, and it would even produce 
> good labels for the clusters.  Has this proven to be true?  Our goal is 
> to use clustering to build a nifty graphic interface, probably using Flash.

Carrot2 seems to work nicely.
Demo here...

Search for something like "artificial intelligence" in my Wikipedia 
Search engine:

The click on "see clustered results.." link to go here:

And voilla, what seems like decent clusters.

I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave

> Thanks for any pointers.
> Owen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message