lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Better Way of calculating Cosine Similarity between documents
Date Fri, 18 May 2012 09:52:44 GMT

can you provide a minimal example (no. of sentences max 5)? 1 -> 0.85  
seems a rather big decrease in score to me, so unless you removed the  
longest sentence with the rarest words in the collection, I smell some  
bug, e.g. you forgot to remove it from the denominator as well, etc.  
It would also be a good idea to compute the distance without IDF  
weighting to see if you experience a similar effect.

David Nemeskey

Quoting Kasun Perera <>:

> Hi all
> I’m indexing collection of documents using Lucene specifying TermVerctor at
> the indexing time. Then I retrieve terms and their term frequencies by
> reading the index and calculate TF-IDF scores vector for each document.
> Then using TF-IDF vectors, I calculate pairwise cosine similarity between
> documents using the equation here
> This is my problem
> Say I have two identical documents “A” and “B” in this collection (A and B
> have more than 200 sentences).
> If I calculate pairwise cosine similarity between A and B it gives me
> cosine value=1 which is perfectly OK.
> But If I remove a single sentence from Doc “B”, it gives me cosine
> similarity value around 0.85 between these two documents. The documents are
> almost similar but cosine values are not. I understand the problem is with
> the equation that I’m using.
> Is there better way/ better equation that I can use for calculating cosine
> similarity between documents?
> --
> Regards
> Kasun Perera

This message was sent using IMP, the Internet Messaging Program.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message