lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Document Boosts as floats, alternate encoding
Date Wed, 15 Dec 2004 23:09:58 GMT
On Wednesday 15 December 2004 21:46, Dan Climan wrote:
> I'm experimenting with Document boosts and I'm finding them effective for
> certain types of scoring enhancements. My concern is that because of the way
> they are stored (ie an encoded byte). There are not enough boost values to
> cover the typical boosting. I've written a custom Similarity function (ie
> extended Similarity by overriding encodeNorm and decodeNorm). However, that
> still limits me to 256 distinct boost factors. These 256 factors have to
> cover the full range of the combined effect of Document boosts, Field boosts
> and length norms in the default calculation.
>  
> I'd be interested in at least an option to store these as floats. Any change
> requires changes to the file formats (either a new file or a new format for
> norm files (*.f??)). 
>  
> Has this been considered and rejected? Clearly, there would be additional
> memory consumption for an array of floats instead of bytes, but it seems
> like a small overall impact. Speed should not be affected as these are
> always converted to floats for use in the existing calculations.
>  
> I hesitate to embark on a change that affects file formats so if folks have
> any suggestions on alternative approaches, I'm interested in hearing about
> them.
>  
> Any suggestions for how to incorporate these and stay consistent with
> current Lucene design?

You can use an alternative encoding within the current file format.
Still only 256 values, but the current range is from 10e-10 to 10e+10.
Since these values represent the square root of a field length, this 
supports field length of up to 10e100, far too large for practical
purposes, except when the field norm is used for boosting
with other factors, eg. determined by URL link analysis.

At the moment I'm considering the  following encoding for DensityTermQuery,
which needs a smaller range than the current default.
The actual work is done in decodeNorm2() below.
Unfortunately I lost my notes on the range provided by decodeNorm2
as defined here. Iirc, the idea was to encode 8 different field length
values per factor two:
0 1 2 3 4 5 6 7
8 10 12 14 16 18 20 22
24 28 32 ...
...
etc.
The inverse is returned to get the term density by multiplying instead of
dividing. No square root is used to encode, and there is much less room
left for encoding other factors in the field norm.


/* All code under APL2:  */

class AlternateNorms extends Similarity {
/* Alternative encoding with more encoded values in actually used range: */
  public static byte encodeNorm2(float inverseLength) {
    if (inverseLength <= 0.0f) {
      return (byte)0;
    }
    if (inverseLength > 1.0f) {
      inverseLength = 1.0f;
    }
    int bit = 1 << 7;
    int res = 0;
    // find the smallest byte that decodes to a length bigger than
    // or equal to the given length.
    while (bit > 0) { // literal binary search
      int next = (res | bit);
      if (decodeNorm2((byte) next) >= inverseLength) {
        res = next;
      }
      bit >>= 1;
    }
    return (byte) res;
  }
  
  public static float decodeNorm2(byte b) { // This defines the encoding and decoding.
    int mantissa = b & 0x07;
    int exponent = (b >> 3) & 0x1f;
    long pow2 = 1L << exponent;
    long length = ((pow2 * mantissa) + 8 * (pow2 - 1));
    return length == 0L ? 0.0f : (1.0f / length);
  }
}

I've also used this to show the norms:

/* Show the stored norms and their squares:
  stored Norms vary roughly from 10e-10 to 10e10.
  Squared stored norms represent actual lengths,
  various from 10e-100 to 10e100,
  which is far too much, as long as no ranking by document weighting is done.
  
  The lengths only need to vary from 10e0 to 10e8, ie. the length itself
  might be stored encoded, instead of it's square root.
*/
  static {
    //float[] norms = new float[256];
    for (int i = 0; i < 256; i++) {
      float norm = decodeNorm2((byte)i);
      System.out.print(" " + norm + " " + (encodeNorm2(norm * 0.99f)  & 0xFF));
      if ((i + 1) % 4 == 0) {
        System.out.println();
      }
    }
/* Show the stored norms and their squares:
  stored Norms vary roughly from 10e-10 to 10e10.
  Squared stored norms represent actual lengths,
  various from 10e-100 to 10e100,
  which is far too much, as long as no ranking by document weighting is done.
  
  The lengths only need to vary from 10e0 to 10e8, ie. the length itself
  might be stored encoded, instead of it's square root.
  static {
    //float[] norms = new float[256];
    for (int i = 0; i < 256; i++) {
      float norm = decodeNorm2((byte)i);
      System.out.print(" " + norm + " " + (encodeNorm2(norm * 0.99f)  & 0xFF));
      if ((i + 1) % 4 == 0) {
        System.out.println();
      }
    }
    //java.util.Arrays.sort(norms);
    for (int i = 0; i < 256; i++) {
      //System.out.print(" " + norms[i] + "**2=" + (norms[i]*norms[i]));
      //System.out.print(" " + norms[i]);
      //if ((i + 1) % 4 == 0) {
      //if ((i + 1) % 8 == 0) {
        //System.out.println();
      //}
    }
  }
*/  }








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message