lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangyu Jin <>
Subject Re: similarity matrix - more clear
Date Tue, 30 Nov 2004 15:09:34 GMT

I also have the same task as you do. According to my understanding,
suppose their are N documents, your approach will take N^2 similarity

Although there are N(N-1)/2 distinct document pairs,
the similarity calculation (according to my understanding) in Lucene is
asymmetric, so this means you have to calculate N(N-1) similaries.
Therefore, seems your approach is not so redundant since you have to
calculate O(N^2) order of similarities.

On Tue, 30 Nov 2004, Roxana Angheluta wrote:

> Dear all,
> Yesterday I've asked a question about geting the similarity matrix of a
> collection of documents from an index, but I got only one answer, so
> perhaps my question was not very clear.
> I will try to reformulate:
> I want to use Lucene to have efficient access to an index of a
> collection of documents. My final purpose is to cluster documents.
> Therefore I need to have for each pair of documents a number signifying
> the similarity between them.
> A possible solution would be to initialize in turn each document as a
> query, do a search using an IndexSearcher and to take from the search
> result the similarity between the query (which is in fact a document)
> and all the other documents. This is highly redundant, because the
> similarity between a pair of documents is computed multiple times.
> I was wondering whether there is a simpler way to do it, since the index
> file contains all the information needed. Can anyone help me here?
> Thanks,
> roxana
> PS I know about the project Carrot2, which deals with document
> clustering, but I think is not appropriate for me because of 2 reasons:
> 1) I need to keep the index on the disk for further reusage
> 2) I need to be able to search efficiently in the index
> I thought Lucene can help me here, am I wrong?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message