lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
Date Wed, 16 Nov 2005 20:32:51 GMT
Yonik Seeley wrote:
> Hmmm, very interesting idea.
> Less than one decimal digit of precision might be hard to swallow when
> you have to add scores together though:
> 
> smallfloat(score1) + smallfloat(score2) + smallfloat(score3)
> 
> Do you think that the 5/3 exponent/mantissa split is right for this,
> or would a 4/4 be better?

The float epsilon should ideally be greater than the minimum score 
increment, and the float range should ideally be at least 100x greater 
than the maximum score increment, to permit boosting, large queries, etc.

Given a 100M document collection, the maximum idf is log(100M) = ~18, 
with a length-normalized tf of 1, for a max of 18.  So the float range 
should ideally be around 1800 or greater.

The minimum idf is 1, and the minimum normalized tf with 10k word 
documents is 1/100.  So the float epsilon should ideally be less than 1/100.

5 bits of mantissa and 3 bits of exponent is closest to this, but not 
quite there, with an epsilon of 1/32 and a range of up to ~1000.

Did I get the math right?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message