Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Date: Wed, 1 Nov 2006 10:41:54 +0200
From: "Nadav Har'El" <nyh@math.technion.ac.il>
To: java-user@lucene.apache.org
Subject: Re: Indexing floating point number
Message-ID: <20061101084154.GA3222@fermat.math.technion.ac.il>
References: <85c493340610300157t57ea76a3oa1695e8b3a533b23@mail.gmail.com>
 <c68e39170610300955h33dc0156t59054e5970a95b64@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="2fHTh5uZTiUOsy+g"
Content-Disposition: inline
In-Reply-To: <c68e39170610300955h33dc0156t59054e5970a95b64@mail.gmail.com>
User-Agent: Mutt/1.4.2.2i
Hebrew-Date: 10 Heshvan 5767

--2fHTh5uZTiUOsy+g
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Mon, Oct 30, 2006, Yonik Seeley wrote about "Re: Indexing floating point number":
> On 10/30/06, KEGan <khoon.ee.gan@gmail.com> wrote:
> >Newbie question. How do we index floating point number in Lucene, so that 
> >it
> >is sortable ? There is a built-in utility class 'NumberTools' which help
> >with indexing integer. Does Lucene has the same mechanism for floating 
> >point
> >number?
> 
> You can look at NumberUtils in Solr, it has conversions for
> int/long/float/double that make strings that sort correctly

This question was asked so many times, that I think it belongs in the FAQ.
Moreover, I wonder if we shouldn't add a floating-point version of NumberTools
to the base Lucene, instead of referring people to Solr every time this
question is asked?

I'm attaching another possible implementation. It's a class ToSortableString
with two static functions fromDouble(double) and fromInt(int). These convert
numbers into strings which sort lexicographically in the same order as the
original numbers.

The long comments in the attached code explain my encoding, the details of
which are different from Solr's. In particular differences include:
1. In my code, integers and floating point can be mixed: the encoding of
   7 (integer) and 7.0 (double) is the same.
2. In my code, we can decide how much precision (significant digits) we want
   to keep from doubles. I decided to keep only 10 significant digits (out of
   the available ~16) by default, which is more than enough for search
   application.
3. My code currently uses base 10 encoding, and therefore often generates
   longer strings than Solr's NumberTools. Moving to base 100 or even 256
   (as I suggest in the comments) can eliminate this difference.
4. My code is probably a bit slower, though I didn't actually check.
5. I did not yet implement a reverse transformation (from String back to
   number), because it practice it wasn't needed.

> (for range queries as well as sorts).

This support for floating-point range queries is a "honey trap". It sounds
really good, but when you actually try to use it, you'll notice how extremely
inefficient it is during search (unless you're lucky enough to have only a
few distinct floating point values in the given range).

To really support numeric search well, Lucene may need some additional data
structure in addition to the usual lexicon and posting lists - like some sort
tree of sub-range posting lists, or something.
But perhaps this can even be done without any changes to Lucene: for example,
imagine we index the number 2.345 as four tokens "2.345", "2.34", "2.3", "2".
Now, if we're looking for the range [2.3-2.4) we can just search for "2.3",
and if we're looking for the range [2.2-2.45) we can search for
	"2.2" OR "2.3" OR "2.40" OR "2.41" OR "2.42" OR "2.43" OR "2.44"
(note that this is an OR of just 7 posting lists, even if this range contains
thousands of distinct values).

I wonder if anybody ever done such a thing (or came up with an better
solution) in Lucene.

-- 
Nadav Har'El                        |   Wednesday, Nov 1 2006, 10 Heshvan 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |The socks in my drawer are like
http://nadav.harel.org.il           |snowflakes: No two are alike.

--2fHTh5uZTiUOsy+g
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="ToSortableString.java"

package com.ibm.search.lucene.query.parser;

import java.text.DecimalFormat;

/**
 * The functions in this class do a transformation which is currently necessary
 * for using Lucene's RangeQuery over numbers. They convert floating-point
 * numbers into strings which sort lexicographically in the same order as the
 * original numbers.
 * 
 * Unfortunately, for any number domain other than positive integers, we are
 * not aware of a way of doing this sort of conversion while keeping the
 * numbers readable. The result of the conversion (even for positive integers)
 * will be unreadable to casual readers. It is therefore recommended that if
 * you want numeric fields to be stored, not just index, store the original
 * numbers rather than the result of this conversion. It is possible to do
 * this with the aid of a special Analyzer, which will carry out this
 * conversion for the purpose of indexing but will leave the un-analyzed value
 * stored.
 * 
 * The conversions we carry out is one-to-one (up to the 10 decimal significant
 * digits of precision - as explained below), and it should be easy to carry
 * out the reverse conversion. However, we haven't implemented this yet
 * because we haven't found it necessary.
 */

public class ToSortableString {
	/**
	 * fromDouble() implements a transformation which is currently
	 * necessary for matching a range of floating point numbers using
	 * Lucene's RangeQuery. This function converts floating-point numbers
         * into strings which sort lexicographically in the same order as the
	 * original numbers.
	 * 
	 * We correctly handle numbers in double's entire range, but limit
	 * precision to 10 significant digits. This was a deliberate choice:
	 * 10 decimal digits are enough to distinguish between 32 bit
	 * integers, and more floating-point precision is hardly required in
	 * search applications.
	 */

	/*
	 * Our string representation of a floating point number will look as
	 * follows: SEEENN... Where: 1. S is the sign of the number, "n" for
	 * negative, "p" for positive. This ensures that all negative numbers
	 * come before all positive numbers. (In the alphabet, "n" comes
	 * before "p"). 2. EEE is the base-10 exponent leaving a mantissa
	 * between >=0.1, <1.0. We add 500 to it, to get all the negative and
	 * positive exponents into 3 digits (typically, a double's exponent
	 * can be between -308 and 308). For negative numbers, we negate their
	 * exponent first. The idea is that positive numbers with lower
	 * exponent come before (are less) than those with higher exponent
	 * (1e-1 is smaller than 1e2) and for negative numbers, this is
	 * reversed (-1e-1 is larger than -1e2). We assume that for Java
	 * doubles, the maximum exponent is around 300, so our addition of 500
	 * is more than enough, and ensures that the result doesn't even need
	 * padding with zeros. 3. NN... is the mantissa, >=0.1 and <1.0.
	 * When the original number is positive, we output the mantissa as a
	 * base 10 floating point without the initial "0.". For negative
	 * numbers, 1 minus the mantissa is printed instead (the result is
	 * <=0.9, >0). The mantissa is printed with a certain precision (up to
	 * 15 digits makes sense on most computers, but we don't need this
	 * sort of precision and we'll settle with 10 - enough for accurate
	 * representation of Java ints). Trailing zeros are not printed,
	 * which is ok, because "0.1" is lexicographically before "0.11".
	 * Positive and negative Infinities are also supported, and generate
	 * strings "pinf" and "n-inf", respectively, which sort after and
	 * before all numbers, respectively. NaNs (Not-a-Number) are not
	 * supported, because their sorting order relative to other numbers
	 * is undefined.
	 * 
	 * If this representation appears similar to the IEEE 754 floating
	 * point standard, it is not entirely an accident. IEEE 754
	 * representation is almost sortable, except that for a need to invert
	 * all bits for negative numbers, and that the sign bit is on for
	 * negative numbers, rather than positive (larger) numbers. Solr's 
	 * "NumberUtils", for example, uses Java's Double.floatToRawIntBits
	 * to very easily do this, and treat ints, longs, floats and doubles
	 * very similarly; In their code, a 4 byte int or float is transformed
	 * into 3 java (16-bit) chars, and a 8 byte long or double is
	 * transformed into 5 java chars. 
	 * 
	 * Note that we chose base 10 representation because it was more
	 * convinient to code and debug, as well as giving a shorter
	 * representation for floating point numbers with a short decimal
	 * representation (say, 0.12). However, using a larger base would have
	 * yielded a more compact representation. It particular, it should be
	 * easy to half the length of this function's output by converting
	 * its output to base 100.
	 */

	public static String fromDouble(double d) {
		boolean negative = d < 0;
		if (negative)
			d = -d;
		// Note: the addition of 1e-15 below is to offset rounding
		// errors in our log calculation, e.g., log(1000)/log(10)-3 
		// comes out -4e16 on the machine I tried this on. This
		// addition is not important at all, except to get nicer
		// output for integers - to convert 1000 into p5041 and not
		// p5039999999999.
		int exponent = (int) Math.floor(Math.log(d) / log10 + 1 + 1e-15);
		double mantissa = d * Math.exp(-exponent * log10);

		// If the exponent is too negative, d is zero. Just print "p0",
		// which will sort before any positive number, but after every
		// negative number.
		if (exponent <= -400)
			return "p0";
		// If the exponent is too positive, we're at infinity, either
		// positive or negative. Print "n-inf" (sorts before all
		// negative numbers we print) or "pinf" (sorts after all
		// positive numbers we print).
		if (exponent >= 400)
			return negative ? "n-inf" : "pinf";

		if (negative)
			exponent = -exponent;
		exponent += 500;

		if (negative)
			return "n" + Integer.toString(exponent)
					+ mantissaFormat(1.0 - mantissa);
		else
			return "p" + Integer.toString(exponent) + mantissaFormat(mantissa);
	}

	private static final double log10 = Math.log(10);

	// This function limits the number's precision to 10 significant
	// decimal digits. See comment above for the rationale for this choice.
	private static String mantissaFormat(double m) {
		// Unfortunately, we need to create a new "DecimalFormat"
		// object every time, because that object is not thread safe.
		// This may be a performance bottleneck.
		DecimalFormat mantFormatter = new DecimalFormat(".##########");
		// 1234567890
		String result = mantFormatter.format(m);
		if (result.charAt(0) == '.')
			return result.substring(1); // remove initial "."
		else
			// 0.999.. was printed as 1.0. We don't want that.
			return "9999999999"; // we could have also printed "x" or
		// anything that comes after all numbers.
		// 1234567890
	}

	/**
	 * fromInt() implements a transformation which is currently necessary
	 * for matching a range of integers using Lucene's RangeQuery. This
	 * function converts integers into strings which sort
	 * lexicographically in the same order as the original numbers.
	 * 
	 * fromInt() uses the same transformation as fromDouble(), so that
	 * fromInt(3) is idential to fromDouble(3.0) and comes before
	 * fromDouble(3.1), and so on.
	 */
	public static String fromInt(int i) {
		// NOTE: for simplicity, we could have just returned
		// fromDouble((double)i), instead of inventing this integer
		// implementation. But this was an interesting exercise :-)
		if (i == 0)
			return "p0";
		boolean negative = i < 0;
		String istr;
		if (negative)
			istr = Long.toString(-(long) i); // long needed for
		// -Int.MIN_VALUE...
		else
			istr = Integer.toString(i);
		int exponent = istr.length();
		// remove trailing zeros from istr (just like happens when
		// printing floating point). Also, if negative, replace each
		// digit by 9-digit (identical to taking 1-mantissa in
		// floating point).
		int len = exponent;
		while (len > 0 && istr.charAt(len - 1) == '0') {
			len--;
		}
		StringBuffer sb = new StringBuffer();
		if (negative) {
			sb.append('n');
			sb.append(Integer.toString(500 - exponent));
			for (int j = 0; j < len; j++)
				sb
						.append((char) ('0' + ('9' - istr.charAt(j)) + (j == (len - 1) ? 1
								: 0)));
		} else {
			sb.append('p');
			sb.append(Integer.toString(500 + exponent));
			sb.append(istr.substring(0, len));
		}
		return sb.toString();
	}
}


--2fHTh5uZTiUOsy+g
Content-Type: text/plain; charset=us-ascii

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
--2fHTh5uZTiUOsy+g--