lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hamilton" <>
Subject RE: Document clustering using lucene
Date Thu, 15 Jun 2006 13:10:50 GMT
I'v been thinking about a similar problem.  However, it seems that the similarity score returned
by a search is only relevant within those search results.  You can't compare the similarity
scores from two different searches.  I think you will have to compute the similarities yourself
using the term vectors.


-----Original Message-----
From: Prasenjit Mukherjee []
Sent: Thursday, June 15, 2006 6:51 AM
Subject: Document clustering using lucene

I want to do some document  clustering on a corpus of  ~ 100,000 
documents, with average doc size being ~ 7k. I have looked into carrot2 
but it seems to work only for relatively short documents and has soem 
scalign issues for large corpus.  Certainly for these kind of corpus 
size, one cannot use a pure memory based clustering algorithm. Hence the 
possible use of lucene.

I was thinking of using lucene to create the similarity matrix (between 
documents).  Before adding a document (i.e. D-k) to the lucene index, we 
can compute the document similarity between D-k with all other existing 
documents by creating a Query out of D-k and doing a search on the 
existing index. We can take the score of each document as   similarity 
measure between the document and D-k. It is going to be a symmetric and 
parse matrix. Now we can use this similarity  matrix and feed it to any 
similarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and are 
happy  to share their experiences.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message