lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Diego Ceccarelli (BLOOMBERG/ LONDON)" <dceccarel...@bloomberg.net>
Subject Re: Personalized search parameters
Date Mon, 08 Jan 2018 14:00:24 GMT
I'm assuming that you are writing the cosine similarity and you have two vectors containing
the pairs <term, tfidf>. The two vectors could have different sizes because they only
contain the terms that have tfidf != 0.
if you want to compute cosine similarity between the two lists you just have to consider the
pairs that appears in **both the vectors**, because otherwise if a term doesn't appear in
one of the two the product is going to be 0, so it will not contribute to the final tfidf
score. 

(Really old) Example: https://github.com/diegoceccarelli/dexter/blob/fb4bbcb27a13da2665f3c19d6c75bfc4f5778440/dexter-core/src/main/java/it/cnr/isti/hpc/dexter/lucene/LuceneHelper.java#L386


From: solr-user@lucene.apache.org At: 01/06/18 17:24:07To:  solr-user@lucene.apache.org
Subject: Re: Personalized search parameters

Don't we need vectors of the same size to calculate the cosine similarity? 
Maybe I missed something, but following that example it looks like i have to
manually recreate the sparse vectors, because the term vector of a document
should (i may be wrong) contain only the terms that appear in that document.
Am I wrong?

Given that i assumed (and that example goes in that direction) that we have
to manually create the sparse vector by first collecting all the terms and
then calculating the tf-idf frequency for each term in each document.
That's what i did, and I obtained vectors of the same dimension for each
document, i was just wondering if there was a better optimized way to obtain
those sparse vectors.


--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message