lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3957) Document precision requirements of setBoost calls
Date Wed, 18 Apr 2012 18:04:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256760#comment-13256760
] 

Robert Muir commented on LUCENE-3957:
-------------------------------------

I don't understand why its long and winded, its documented in tons of places in lucene,
in-fact its actually over-specified in file-formats, for example, because even in 3.5
the encoding of the normalization byte is an implementation detail of the Similarity:
its just that you can only use a single byte.

In trunk its definitely overspecified since besides the above, the Similarity can use
more than a byte if it wants to.

1. Main website (scoring): 
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html
{noformat}
Indexing time boosts are preprocessed for storage efficiency and written to the directory
(when writing the document) in a single byte (!) as follows.
...
This composition of 1-byte representation of norms...
...
Encoding and decoding of the resulted float norm in a single byte are done by the static methods
of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not
guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search)
time, this norm is brought into the score of document as norm(t, d), as shown by the formula
in Similarity. 
{noformat}

2. Main website (file formats):
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#Normalization%20Factors
{noformat}
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8
contain the 5-bit exponent.

These are converted to an IEEE single float value as follows: 
...
{noformat}

3. Javadocs (Similarity): 
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
{noformat}
However the resulted norm value is encoded as a single byte before being stored. At search
time, the norm byte value is read from the index directory and decoded back to a float norm
value. This encoding/decoding, while reducing index size, comes with the price of precision
loss...
 
Compression of norm values to a single byte saves memory at search time, because once a field
is referenced at search time, its norms - for all documents - are maintained in memory.
 
The rationale supporting such lossy compression of norm values is that given the difficulty
(and inaccuracy) of users to express their true information need by a query, only big differences
matter. 
{noformat}


                
> Document precision requirements of setBoost calls
> -------------------------------------------------
>
>                 Key: LUCENE-3957
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3957
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: general/javadocs
>    Affects Versions: 3.5
>            Reporter: Jordi Salvat i Alabart
>
> The behaviour of index-time boosts seems pretty erratic (e.g. a boost of 8.0 produces
the exact same score as a boost of 9.0) until you become aware that these factors end up encoded
in a single byte, with a three-bit mantissa. This consumed a whole day of research for us,
and I still believe we were lucky to spot it, given how deeply dug into the code & documentation
this information is.
> I suggest adding a small note to the JavaDoc of setBoost methods in Document, Fieldable,
FieldInvertState, and possibly AbstractField, Field, and NumericField.
> Suggested text:
> "Note that all index-time boost values end up encoded using Similarity.encodeNormValue,
with a 3-bit mantissa -- so differences in the boost value of less than 25% may easily be
rounded away."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message