lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Silbermann <mathias.silberm...@web.de>
Subject adapting lucene's practical scoring function
Date Thu, 25 Mar 2010 19:07:36 GMT
Dear Lucene Users,

I'd like to use Lucene to find scientific papers in the index that are 
similar to a given paper from the
index. This seems to be possible using the MoreLikeThis-feature or 
wrapping the given document
in a query composed of several other queries (BooleanQuery). The 
similarity is calculated
according to Lucene's Practical Scoring Function defined in the JavaDoc 
of class Similarity.

What I am trying to do is to calculate the "semantic document 
similarity". One example similarity
function for that purpose is given on page two of the paper 
"Corpus-based and Knowledge-based
Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). 
Instead of using the TF and
IDF values, it uses IDF values and the relatednesses between every 
unique words in the documents
to compare. First, it sums up the relatednesses of each unique word in 
document 1 to its most
related word in document 2 multiplied by its IDF value. The same 
procedure is done for document1.
After that, the sums are averaged.

My question is: Given I am able to store WordNet-Words extracted from 
the documents in the
index and pre-calculate the word-word similarities, is it possibe / does 
it make sense (e.g. from
the (computational) effort point of view) to adapt the Practical Scoring 
Function to such a function
of semantic document similarity? And where (in which class) is the 
Practical Scoring Function
implemented, i.e. where are the values of TF, IDF, Boost... put together?

Regards,
Mathias Silbermann

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message