köszi!
On Fri, May 18, 2012 at 11:19 AM, Kasun Perera <kasunp@opensource.lk> wrote:
> Hi all
>
> I’m indexing collection of documents using Lucene specifying TermVerctor at
> the indexing time. Then I retrieve terms and their term frequencies by
> reading the index and calculate TFIDF scores vector for each document.
> Then using TFIDF vectors, I calculate pairwise cosine similarity between
> documents using the equation here
> http://en.wikipedia.org/wiki/Cosine_similarity.
>
> This is my problem
>
> Say I have two identical documents “A” and “B” in this collection (A and B
> have more than 200 sentences).
>
> If I calculate pairwise cosine similarity between A and B it gives me
> cosine value=1 which is perfectly OK.
>
> But If I remove a single sentence from Doc “B”, it gives me cosine
> similarity value around 0.85 between these two documents. The documents are
> almost similar but cosine values are not. I understand the problem is with
> the equation that I’m using.
>
> Is there better way/ better equation that I can use for calculating cosine
> similarity between documents?
>
> 
> Regards
>
> Kasun Perera
>
