lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: question about TermQuery
Date Mon, 08 Oct 2001 18:00:53 GMT
Great. Thanks for checking. I'm glad that this was a false alarm.
It also seems that because sumOfSquareWeights reassigns all variables 
again, the TermQuery instances can be reused in subsequent queries, 
although the can't be used concurrently for multiple queries. Are Query 
objects generally suiteable for reuse? So, for example, could they be 
used as keys for caching query results? My guess is that they can, as 
long as they are not used for executing the query.

Doug Cutting wrote:

>It is correct to include idf twice.  Recall that the weighting is roughly:
>  (tf_q * idf_t / norm_q) * (tf_d * idf_t / norm_d)
>
>The TermQuery.weight field has all of this that is not document specific,
>i.e., everything but in this but tf_d and norm_d, so weight should be:
>  (tf_q * idf_t / norm_q) * idf_t
>
>The code is a little different, since we don't calculate tf_q, the frequency
>of the term in the query, assuming that it is one, and instead use a 'boost'
>factor.  But the term's idf (idf_t) should really be in there twice.
>
>The query normalization factor, 1/norm_q, is calculated based on the value
>of the sumOfSquaredWeights, and is passed back in through the normalize()
>call.
>
>Normalize() is only called once per call to sumOfSquaredWeights.  These
>calls are initiated on a Query in the method Query.scorer():
>
>  static Scorer scorer(Query query, Searcher searcher, IndexReader reader)
>    throws IOException {
>    query.prepare(reader);
>    float sum = query.sumOfSquaredWeights(searcher);
>    float norm = 1.0f / (float)Math.sqrt(sum);
>    query.normalize(norm);
>    return query.scorer(reader);
>  }
>
>So it all looks okay to me.
>
>Doug
>
>>-----Original Message-----
>>From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>>Sent: Sunday, October 07, 2001 2:19 PM
>>To: lucene-dev@jakarta.apache.org
>>Subject: question about TermQuery
>>
>>
>>I'm looking through the TermQuery code (and generally trying to 
>>understand exactly how the searching works) and I found this 
>>code that 
>>looks suspicious to me. It is very likeley that I just don't 
>>understand 
>>what's going on, but there is a chance that this is a bug, so 
>>I wanted 
>>to ask for clarification / review from Doug and others.
>>
>>In the TermQuery.normalize(float norm), weight is being 
>>multiplied first 
>>by the normalization factor (the argument) and then by the 
>>idf, that was 
>>stored in the TermQuery before. Although I can't say for sure 
>>that this 
>>is wrong, it does look suspect. First, idf is already factored into 
>>weight in the sumOfSquaredWeights() method, and second, if 
>>normalize is 
>>called multiple times, idf will be multiplied into weight over and 
>>over... Plus the comment in normalize doesn't really make 
>>sense, and the 
>>way the code is written makes me think that this is a problem 
>>caused by 
>>a CVS merge conflict, and that only the line "weight *= norm" 
>>should be 
>>in that method. Am I right?
>>
>>======================================================
>>  final float sumOfSquaredWeights(Searcher searcher) throws 
>>IOException {
>>    idf = Similarity.idf(term, searcher);
>>    weight = idf * boost;
>>    return weight * weight;              // square term weights
>>  }
>>
>>  final void normalize(float norm) {
>>    weight *= norm;                  // normalize for query
>>    weight *= idf;                  // factor from document
>>  }
>>======================================================
>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message