lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akos Tajti <akos.ta...@gmail.com>
Subject Re: Better Way of calculating Cosine Similarity between documents
Date Fri, 18 May 2012 11:06:53 GMT
köszi!





On Fri, May 18, 2012 at 11:19 AM, Kasun Perera <kasunp@opensource.lk> wrote:

> Hi all
>
> I’m indexing collection of documents using Lucene specifying TermVerctor at
> the indexing time. Then I retrieve terms and their term frequencies by
> reading the index and calculate TF-IDF scores vector for each document.
> Then using TF-IDF vectors, I calculate pairwise cosine similarity between
> documents using the equation here
> http://en.wikipedia.org/wiki/Cosine_similarity.
>
> This is my problem
>
> Say I have two identical documents “A” and “B” in this collection (A and B
> have more than 200 sentences).
>
> If I calculate pairwise cosine similarity between A and B it gives me
> cosine value=1 which is perfectly OK.
>
> But If I remove a single sentence from Doc “B”, it gives me cosine
> similarity value around 0.85 between these two documents. The documents are
> almost similar but cosine values are not. I understand the problem is with
> the equation that I’m using.
>
> Is there better way/ better equation that I can use for calculating cosine
> similarity between documents?
>
> --
> Regards
>
> Kasun Perera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message