lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Computing Relevancy Differently
Date Fri, 28 Feb 2003 22:52:24 GMT
Your attachment did not make it, so I cannot see your code.

If you think there's a bug, cuold you please provide a complete, 
self-contained test case?  You could, for example, model this after the 
TestSimilarity class in the test code hierarchy.

The lengthNorm(String,int) method is called when you index the document.

Doug

Terry Steichen wrote:
> Doug,
> 
> I've implemented a subclass of DefaultSimilarity (called WESimilarity.java,
> copy attached) which defines a new lengthNorm() method more or less as you
> suggested.  I then added a line prior to using my IndexWriter:
> writer.setSimilarity(new WESimilarity()), and a similar line prior to using
> my IndexSeacher: searcher.setSimilarity(new WESimilarity()).
> 
> The result:
> 1) There's no change whatsoever in the computed scores, and
> 2) The debugging messages never get printed out.
> 
> I know the WESimilarity is being used (because if I rename it I get an
> exception), but it does not appear that the new lengthNorm() method is being
> called.
> 
> It's probably some silly goof, but I can't figure out where it is.
> 
> If you (or anyone else, of course) have any ideas/suggestions, I'd
> appreciate them.
> 
> Regards,
> 
> Terry
> 
> ----- Original Message -----
> From: "Terry Steichen" <terry@net-frame.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Monday, February 10, 2003 2:28 PM
> Subject: Re: Computing Relevancy Differently
> 
> 
> 
>>Doug,
>>
>>That's excellent.  Just what I've been looking for.  I'll start
>>experimenting shortly.
>>
>>Regards,
>>
>>Terry
>>
>>----- Original Message -----
>>From: "Doug Cutting" <cutting@lucene.com>
>>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>>Sent: Monday, February 10, 2003 1:57 PM
>>Subject: Re: Computing Relevancy Differently
>>
>>
>>
>>>Terry Steichen wrote:
>>>
>>>>Can you give me an idea of what to replace the lengthNorm() method
> 
> with
> 
>>to,
>>
>>>>for example, remove any special weight given to shorter matching
>>
>>documents?
>>
>>>The goal of the default implementation is not to give any special weight
>>>to shorter documents, but rather to remove the advantage longer
>>>documents have.  Longer documents are likely to have more matches simply
>>>because they contain more terms.  Also, for the query "foo", a document
>>>containing just "foo" is a better match than a longer one containing
>>>"foo bar baz", since the match is more exact.
>>>
>>>However, one problem with this approach can be that very short documents
>>>are in fact not very informative.  Thus a bias against very short
>>>documents is sometimes useful.
>>>
>>>
>>>>I can certainly go through a bunch of trial-and-error efforts, but it
>>
>>would
>>
>>>>help if I had some grasp of the logic initially.
>>>>
>>>>For example, from DefaultSimilarity, here's the lengthNorm() method:
>>>>
>>>>  public float lengthNorm(String fieldName, int numTerms) {
>>>>    return (float)(1.0 / Math.sqrt(numTerms));
>>>>  }
>>>>
>>>>Should I (for the purpose of eliminating any size bias) override it to
>>>>always return a 1?
>>>
>>>That's something to try, although, as mentioned above, I suspect your
>>>top hits will be dominated by long documents.  Try it.  It's really not
>>>a difficult experiment!
>>>
>>>One trick I've used to keep very short documents from dominating
>>>results, that, while good matches, are not informative documents, is to
>>>override this with something like:
>>>
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>    }
>>>
>>>This way all fields shorter than 100 terms are scored like fields
>>>containing 100 terms.  Long documents are still normalized, but search
>>>is biased a bit against very short documents.
>>>
>>>
>>>>How would I boost the headline field here? Is that how you are
> 
> supposed
> 
>>to
>>
>>>>use the (presently unused) fieldName parameter?  If that's the case, I
>>>>assume I would logically (to do what I'm trying to do) make this
> 
> factor
> 
>>>>greater than 1 for the 'headline' field, and 1 for all other fields?
>>>
>>>You could do that here too.  So, for example, you could do something
> 
> like:
> 
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>      if (fieldName.equals("headline"))
>>>        n *= 4.0f;
>>>      return n;
>>>    }
>>>
>>>Equivalently, you could create your documents with something like:
>>>
>>>   Document d = new Document();
>>>   Field f = new Field.Text("headline", headline);
>>>   f.setBoost(4.0f);
>>>   ...
>>>
>>>But headlines tend to be short, and naturally benefit from the default
>>>lengthNorm implementation.  So what you really might want is something
>>
>>like:
>>
>>>    public float lengthNorm(String fieldName, int numTerms) {
>>>      if (fieldName.equals("headline"))
>>>        return 4.0f * super.lengthNorm(fieldName, numTerms);
>>>      else
>>>        return super.lengthNorm(fieldName, Math.max(numTerms, 100));
>>>    }
>>>
>>>This is probably what I'd try first.
>>>
>>>Doug
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message