I'm assuming that you are writing the cosine similarity and you have two vectors containing
the pairs <term, tfidf>. The two vectors could have different sizes because they only
contain the terms that have tfidf != 0.
if you want to compute cosine similarity between the two lists you just have to consider the
pairs that appears in **both the vectors**, because otherwise if a term doesn't appear in
one of the two the product is going to be 0, so it will not contribute to the final tfidf
score.
(Really old) Example: https://github.com/diegoceccarelli/dexter/blob/fb4bbcb27a13da2665f3c19d6c75bfc4f5778440/dextercore/src/main/java/it/cnr/isti/hpc/dexter/lucene/LuceneHelper.java#L386
From: solruser@lucene.apache.org At: 01/06/18 17:24:07To: solruser@lucene.apache.org
Subject: Re: Personalized search parameters
Don't we need vectors of the same size to calculate the cosine similarity?
Maybe I missed something, but following that example it looks like i have to
manually recreate the sparse vectors, because the term vector of a document
should (i may be wrong) contain only the terms that appear in that document.
Am I wrong?
Given that i assumed (and that example goes in that direction) that we have
to manually create the sparse vector by first collecting all the terms and
then calculating the tfidf frequency for each term in each document.
That's what i did, and I obtained vectors of the same dimension for each
document, i was just wondering if there was a better optimized way to obtain
those sparse vectors.

