lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3957) Document precision requirements of setBoost calls
Date Wed, 18 Apr 2012 18:04:39 GMT


Robert Muir commented on LUCENE-3957:

I don't understand why its long and winded, its documented in tons of places in lucene,
in-fact its actually over-specified in file-formats, for example, because even in 3.5
the encoding of the normalization byte is an implementation detail of the Similarity:
its just that you can only use a single byte.

In trunk its definitely overspecified since besides the above, the Similarity can use
more than a byte if it wants to.

1. Main website (scoring):
Indexing time boosts are preprocessed for storage efficiency and written to the directory
(when writing the document) in a single byte (!) as follows.
This composition of 1-byte representation of norms...
Encoding and decoding of the resulted float norm in a single byte are done by the static methods
of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not
guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search)
time, this norm is brought into the score of document as norm(t, d), as shown by the formula
in Similarity. 

2. Main website (file formats):
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8
contain the 5-bit exponent.

These are converted to an IEEE single float value as follows: 

3. Javadocs (Similarity):
However the resulted norm value is encoded as a single byte before being stored. At search
time, the norm byte value is read from the index directory and decoded back to a float norm
value. This encoding/decoding, while reducing index size, comes with the price of precision
Compression of norm values to a single byte saves memory at search time, because once a field
is referenced at search time, its norms - for all documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given the difficulty
(and inaccuracy) of users to express their true information need by a query, only big differences

> Document precision requirements of setBoost calls
> -------------------------------------------------
>                 Key: LUCENE-3957
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: general/javadocs
>    Affects Versions: 3.5
>            Reporter: Jordi Salvat i Alabart
> The behaviour of index-time boosts seems pretty erratic (e.g. a boost of 8.0 produces
the exact same score as a boost of 9.0) until you become aware that these factors end up encoded
in a single byte, with a three-bit mantissa. This consumed a whole day of research for us,
and I still believe we were lucky to spot it, given how deeply dug into the code & documentation
this information is.
> I suggest adding a small note to the JavaDoc of setBoost methods in Document, Fieldable,
FieldInvertState, and possibly AbstractField, Field, and NumericField.
> Suggested text:
> "Note that all index-time boost values end up encoded using Similarity.encodeNormValue,
with a 3-bit mantissa -- so differences in the boost value of less than 25% may easily be
rounded away."

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message