lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Vector Space Model: New Similarity Implementation Issues
Date Thu, 28 Feb 2008 17:44:29 GMT

On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:

>
> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
> are
> correct the model is based on  Salton's VSM. However, the  
> calculation of the
> term weight and the doc norm is, in my opinion, different from  
> Lucene. If
> you look at the table given in
> http://www.miislita.com/term-vector/term-vector-3.html, they  
> calcuate the
> document norm based on the weight wi=tfi*idfi. I looked at the  
> interfaces of
> Similarity and DefaultSimilairty class. I place it below:
>
> public float lengthNorm(String fieldName, int numTerms) {
>    return (float)(1.0 / Math.sqrt(numTerms));
> }
>
> You can see that this lengthNorm for a doc is quite different from  
> that
> website norm calculation.

The lengthNorm method is different from the IDF calculation.  In the  
Similarity class, that is handled by the idf() method.  Length norm is  
an attempt to address one of the limitations listed further down in  
that paper:
"Long Documents: Very long documents make similarity measures  
difficult (vectors with small dot products and high dimensionality)"



>
>
> Similarly, the querynorm interface of DefaultSimilarity class is:
>
> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>  public float queryNorm(float sumOfSquaredWeights) {
>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>  }
>
> This is again different the website model.

Query norm is an attempt to allow for comparison of scores across  
queries, but I don't think one should do that anyway.


>
>
> I also have difficulities with tf interface of DefaultSimilarity:
> /** Implemented as <code>sqrt(freq)</code>. */
>  public float tf(float freq) {
>    return (float)Math.sqrt(freq);
>  }
>

These are all callback methods from within the Scorer classes that  
each Query uses.  Have a look at TermScorer for how these things get  
called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.   
Setup a simple Similarity class where you override all of these  
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores  
the way it does.  Then, you can work to modify from there.

Here's the bigger question:  what is your ultimate goal here?  Are you  
just trying to understand Lucene at an academic/programming level or  
do you have something you are trying to achieve in terms of relevance?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message