lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nemeskey.da...@sztaki.mta.hu
Subject Re: Better Way of calculating Cosine Similarity between documents
Date Fri, 18 May 2012 09:52:44 GMT
Hi,

can you provide a minimal example (no. of sentences max 5)? 1 -> 0.85  
seems a rather big decrease in score to me, so unless you removed the  
longest sentence with the rarest words in the collection, I smell some  
bug, e.g. you forgot to remove it from the denominator as well, etc.  
It would also be a good idea to compute the distance without IDF  
weighting to see if you experience a similar effect.

Regards,
David Nemeskey

Quoting Kasun Perera <kasunp@opensource.lk>:

> Hi all
>
> I’m indexing collection of documents using Lucene specifying TermVerctor at
> the indexing time. Then I retrieve terms and their term frequencies by
> reading the index and calculate TF-IDF scores vector for each document.
> Then using TF-IDF vectors, I calculate pairwise cosine similarity between
> documents using the equation here
> http://en.wikipedia.org/wiki/Cosine_similarity.
>
> This is my problem
>
> Say I have two identical documents “A” and “B” in this collection (A and B
> have more than 200 sentences).
>
> If I calculate pairwise cosine similarity between A and B it gives me
> cosine value=1 which is perfectly OK.
>
> But If I remove a single sentence from Doc “B”, it gives me cosine
> similarity value around 0.85 between these two documents. The documents are
> almost similar but cosine values are not. I understand the problem is with
> the equation that I’m using.
>
> Is there better way/ better equation that I can use for calculating cosine
> similarity between documents?
>
> --
> Regards
>
> Kasun Perera
>



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message