lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kamal Najib <kamal.na...@mytum.de>
Subject Re: A simple Vector Space Model and TFIDF usage
Date Thu, 02 Jul 2009 08:49:34 GMT
Hallo Amir,
So far i understand, you have two sets of documents, let we say set1 and set2. If you want
to get the Similarity between the two sets documents you have to index the docs of one and
schearch  each doc of the others as a query, then you can get the similarity of the two documents.
So:
1. Index the docs of the set1.
2. for each doc-element from the set2 do:
   create a query that contains the content text of the doc-element.
   Search them in your indexed docs from set2
   And from the hits you will get, you can get the score of the Similarity     between the
doc-element and every hit.

Your diractory where your indexed docs are saved represents the vector space model you want
to bild. If you want to see how lucene computes the score result, you can use the class explanation
and similarity in lucene Api and you will see that lucene  deals with the documents and querys
in the same way as a vector space model. In the class explanation you can see that lucene
use the TF, IDF and DF to compute the result score.
Best regards.
Kamal.
Original Message:

Hi,
<br />It's my first experiment with Lucene. Please help me.
<br />I'm going to index a set of documents and create a feature vector for each of
them. This vector contains all terms belong to the document that weight using TFIDF.
<br />After that I want to compute the cosine similarity between all documents and produce
a doc-doc similarity matrix. My document set is large and it's important to have a scalable
implementation.
<br />Would you please provide me a guideline or to-do list?
<br />Thank you and kind regards.
<br />
<br />
<br />
<br />      

-- 


Mime
View raw message