On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote:
> Dear Lucene Users,
>
> I'd like to use Lucene to find scientific papers in the index that are similar to a given
paper from the
> index. This seems to be possible using the MoreLikeThis-feature or wrapping the given
document
> in a query composed of several other queries (BooleanQuery). The similarity is calculated
> according to Lucene's Practical Scoring Function defined in the JavaDoc of class Similarity.
>
> What I am trying to do is to calculate the "semantic document similarity". One example
similarity
> function for that purpose is given on page two of the paper "Corpus-based and Knowledge-based
> Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead of using
the TF and
> IDF values, it uses IDF values and the relatednesses between every unique words in the
documents
> to compare. First, it sums up the relatednesses of each unique word in document 1 to
its most
> related word in document 2 multiplied by its IDF value. The same procedure is done for
document1.
> After that, the sums are averaged.
>
Interesting.
> My question is: Given I am able to store WordNet-Words extracted from the documents in
the
> index and pre-calculate the word-word similarities, is it possibe / does it make sense
(e.g. from
> the (computational) effort point of view) to adapt the Practical Scoring Function to
such a function
> of semantic document similarity? And where (in which class) is the Practical Scoring
Function
> implemented, i.e. where are the values of TF, IDF, Boost... put together?
>
This stuff is all done in the Scorer for a specific query (see TermQuery/TermScorer for an
example).
Just thinking out loud here, but I think you will need to write your own Query to do this.
I'm not entirely certain on what that means for you, though. Seems like a FunctionQuery might
help, too. Seems like, just possibly, Lucene is a bit of overkill here other than using
it to get IDF values. Can't you just create a big matrix (maybe w/ Hadoop and HBase or something
similar) of your precomputed similarities and then just lookups on the document?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|