lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Any way to ignore repeated terms in TF calculation?
Date Fri, 16 Jan 2009 00:20:40 GMT

: This is not quite what I was talking about. I was talking about documents
: with a single field. I want the text "Badgers are mammals. Badgers are cute"
: to score higher than the text "Badger Badger" for the term query
: "text:badger".
: Ideally, what I want is to add another factor to the scoring at index time,
: a "sparsity factor" which should cancel out the term frequency as the
: average distance between terms nears 1.

something else you my want to consider: you can omitNorms (or alter the 
lengthNorm function) when indexing so that longer fields aren't penalized 
compared to shorter fields ... in which case a field containing "Badger 
Badger" won't score *higher* then "Badgers are mammals. Badgers are cute" 
because it won't get the short lengthNorm bonus ... if it met your use 
case, you could even make *longer* docs get a higher lengthNorm.

: Sorry about the weird math, I just mean (as I said above) that the sparsity
: factor should cancel out the tf completely if avg_d<=1 and become 1 as avg_d
: gets larger.

it wouldn't exactly match your match, but a simpler approach to consider 
that might be equally effective would be counting the number of unique 
terms in this field at index time (or the ratio of unique terms to total 
terms), and then use that number as the fieldBoost (or index as a numeric 
field that you build a function query on) ... then you can reward docs 
that have a higher number of unique terms, and penalize docs that only 
have a few terms repeated over and over.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message