lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Delgado <joaquin.delg...@oracle.com>
Subject Re: SImilarity between Terms
Date Tue, 18 Oct 2005 18:59:34 GMT
Sebastian,

There is no simple way of calculating similarity between terms in Lucene.

Normally documents are represented in the Vector Space Model (VSM) where 
as some weight is associated to each unique term associated with the 
document (e.g. term frequency or number of times a term occurs within 
the document). This representation is used internally to calculate the 
similarity between documents, treating a query as a special case short 
document. Now, you can get these term vectors per documents with the 
Lucene API if the index was built with the term vectors option. You can 
try building a Term vs. Documents matrix by accumulating document term 
vectors and then applying some LSA or co-occurrence based calculations 
as a similarity, but this may be computationally very expensive if done 
with a huge matrix. Some sampling based techniques have been developed 
(please contact me directly if you wish to learn more about it).

Now, regarding your comment about seeing a term as a document, if you 
inverse the T x D matrix you may think of a term as a document where as 
the vector representation now contains entries with term weights 
associated with each document, thus similar vector space calculations 
(e.g. cosine-based similarity) can be drawn between terms. This just 
looks at a first degree of co-occurrence though (i.e. how many documents 
share the terms) and  does not capture semantic transitivity (second or 
higher degree of co-occurrence) which is very important to determine 
similarity between terms (i.e. synonyms, representing the same concept, 
may be use in different sub-sets of documents thus having low first 
degree of co-occurrence)

-- Joaquin

Sebastian Menge wrote:

>Hi all
>
>Given an index, how can (if i can) get the similarity between _terms_? 
>
>I read somewhere (In an Intro to IR) that a term can be seen as a
>document. Can i do that with lucene, and how would one proceed? (a code
>snippet would be great ..)
>
>Thanks alot, Sebastian.
>
>BTW: I found lucene when looking for a LSA component. I already asked
>for that on the general-list. Other people are also looking for this
>(e.g. fidde andersson). I already get asked whether i got any further.
>So it seems that there is demand for such a component. If i were still a
>student i would try to extend lucene to do something like that, but
>today i dont have the ressources but perhaps another person has.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message