lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roxana Angheluta <>
Subject similarity matrix - more clear
Date Tue, 30 Nov 2004 14:06:08 GMT
Dear all,

Yesterday I've asked a question about geting the similarity matrix of a 
collection of documents from an index, but I got only one answer, so 
perhaps my question was not very clear.

I will try to reformulate:

I want to use Lucene to have efficient access to an index of a 
collection of documents. My final purpose is to cluster documents. 
Therefore I need to have for each pair of documents a number signifying 
the similarity between them.
A possible solution would be to initialize in turn each document as a 
query, do a search using an IndexSearcher and to take from the search 
result the similarity between the query (which is in fact a document) 
and all the other documents. This is highly redundant, because the 
similarity between a pair of documents is computed multiple times.

I was wondering whether there is a simpler way to do it, since the index 
file contains all the information needed. Can anyone help me here?


PS I know about the project Carrot2, which deals with document 
clustering, but I think is not appropriate for me because of 2 reasons:
1) I need to keep the index on the disk for further reusage
2) I need to be able to search efficiently in the index
I thought Lucene can help me here, am I wrong?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message