lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Document Clustering
Date Tue, 08 Feb 2005 08:16:12 GMT
Owen Densmore wrote:

> I would like to be able to analyze my document collection (~1200 
> documents) and discover good "buckets" of categories for them.  I'm 
> pretty sure this is termed Document Clustering .. finding the emergent 
> clumps the documents fall naturally into judging from their term vectors.
> 
> Looking at the discussion that flared roughly a year ago (last message 
> 2003-11-12) with the subject Document Clustering, it seems Lucene should 
> be able to help with this.  Has anyone had success with this recently?
> 
> Last year it was suggested Carrot2 could help, and it would even produce 
> good labels for the clusters.  Has this proven to be true?  Our goal is 
> to use clustering to build a nifty graphic interface, probably using Flash.

Carrot2 seems to work nicely.
Demo here...

Search for something like "artificial intelligence" in my Wikipedia 
Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence

The click on "see clustered results.." link to go here:

http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence

And voilla, what seems like decent clusters.

I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave






> 
> Thanks for any pointers.
> 
> Owen
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message