lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: adapting lucene's practical scoring function
Date Mon, 29 Mar 2010 13:46:16 GMT

On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote:

> Dear Lucene Users,
> I'd like to use Lucene to find scientific papers in the index that are similar to a given
paper from the
> index. This seems to be possible using the MoreLikeThis-feature or wrapping the given
> in a query composed of several other queries (BooleanQuery). The similarity is calculated
> according to Lucene's Practical Scoring Function defined in the JavaDoc of class Similarity.
> What I am trying to do is to calculate the "semantic document similarity". One example
> function for that purpose is given on page two of the paper "Corpus-based and Knowledge-based
> Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead of using
the TF and
> IDF values, it uses IDF values and the relatednesses between every unique words in the
> to compare. First, it sums up the relatednesses of each unique word in document 1 to
its most
> related word in document 2 multiplied by its IDF value. The same procedure is done for
> After that, the sums are averaged.


> My question is: Given I am able to store WordNet-Words extracted from the documents in
> index and pre-calculate the word-word similarities, is it possibe / does it make sense
(e.g. from
> the (computational) effort point of view) to adapt the Practical Scoring Function to
such a function
> of semantic document similarity? And where (in which class) is the Practical Scoring
> implemented, i.e. where are the values of TF, IDF, Boost... put together?

This stuff is all done in the Scorer for a specific query (see TermQuery/TermScorer for an

Just thinking out loud here, but I think you will need to write your own Query to do this.
I'm not entirely certain on what that means for you, though.  Seems like a FunctionQuery might
help, too.   Seems like, just possibly, Lucene is a bit of overkill here other than using
it to get IDF values.  Can't you just create a big matrix (maybe w/ Hadoop and HBase or something
similar) of your precomputed similarities and then just lookups on the document?

Grant Ingersoll

Search the Lucene ecosystem using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message