lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem
Date Thu, 07 Jul 2005 21:39:17 GMT
On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
> Our client, Rojo, is considering overriding the default  
> implementation of
> lengthNorm to fix the bias towards extremely short RSS documents.

Different normalization schemes are given a thorough examination in  
this 1997 paper:

http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Here is what they have to say about the ideal case, "full  
normalization":

[begin excerpt]

... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.

[end excerpt]

Their experimental results indicate that the Lucene default -- 1/sqrt 
(num_terms) -- is quite effective.  The effect upon precision of the  
various normalization schemes is specific to the characteristics of  
the document collection, though.  Extremely short RSS documents would  
seem to be an outlying case.  Anything short of (prohibitively  
expensive) full normalization requires a bias towards one length of  
document.  If you assign maximum weight to the 50-term documents,  
you've probably penalized dictionary definitions.  FWIW, (this is my  
second Lucene post -- I'm not involved with the project), I would  
lean towards the clip method as a default, but it's certainly  
justifiable to tweak a normalization scheme to suit your needs.

> The "flat" and "stretch" factors are specific to my formula.  I've  
> tried
> playing around with how gradual the curve slopes away for smaller  
> and larger
> documents; for example, the red curve really "punishes" documents  
> with less
> than 5 words.

Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt 
(num_terms) the number of terms in the field, rather than the number  
of terms in the document?  If that's true, then how would adopting a  
different curve as default affect the relative weight of a "title"  
field?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message