Owen Densmore wrote: > I would like to be able to analyze my document collection (~1200 > documents) and discover good "buckets" of categories for them. I'm > pretty sure this is termed Document Clustering .. finding the emergent > clumps the documents fall naturally into judging from their term vectors. > > Looking at the discussion that flared roughly a year ago (last message > 2003-11-12) with the subject Document Clustering, it seems Lucene should > be able to help with this. Has anyone had success with this recently? > > Last year it was suggested Carrot2 could help, and it would even produce > good labels for the clusters. Has this proven to be true? Our goal is > to use clustering to build a nifty graphic interface, probably using Flash. Carrot2 seems to work nicely. Demo here... Search for something like "artificial intelligence" in my Wikipedia Search engine: http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence The click on "see clustered results.." link to go here: http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence And voilla, what seems like decent clusters. I'm not sure what the complexity of the algorithm is, but for me ~100 docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM. I suggest: try it w/ ~100 docs, and if you like what you see, keep increasing the # of docs you give it. You might have to wait a while w/ all 1,200 docs... - Dave > > Thanks for any pointers. > > Owen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org