lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: solution: RangeQuery with floating point numbers
Date Sun, 09 Apr 2006 23:32:36 GMT

you may want to look into the way Solr deals with
sortable ints/longs/floats/doubles...

http://svn.apache.org/viewcvs.cgi/incubator/solr/trunk/src/java/org/apache/solr/util/NumberUtils.java?view=markup

...it leverages the fact that Java will give you the raw bits of a
double/float as an long/int, and from there some (obscure) bitwise
operations make sure they sort properly.


: Date: Sun, 9 Apr 2006 14:53:31 +0300
: From: Nadav Har'El <NYH@il.ibm.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: solution: RangeQuery with floating point numbers
:
:
: Hi all,
:
: As Lucene's documentation explain, RangeQuery (and ConstantScoreRangeQuery)
: require their key to be strings which are lexicographically
: (alphabetically) ordered.
: "Lucene in Action", section 6.3.3 ("Handling numeric field-range queries")
: explains
: what to do when you need to range search in positive integers: in that
: case,
: you have to pad the numbers with leading zeros, so they all have the same
: length and sort correctly.
: Other people on this list suggested how to extend this trick for when you
: have
: signed integers.
:
: But I wanted to go all the way to general "double"s, and do a RangeQuery
: on them. Converting doubles into strings which sort in the correct
: order is a non-trivial exercise, but is in fact doable. To prevent others
: from needing to repeat this exercise, here is an example code that I wrote,
: that does this. The code below isn't as long as it looks: most of it is
: comments
: which hopefull explain what I've done.
:
: By the way, it is worth repeating the warning that appears everywhere
: (including Lucene In Action): while this sort of trick works, RangeQuery
: is extremely inefficient (and might not even work) when the number of
: different values that are included in the range is very large.
: ConstantScoreRangeQuery is better, but also inefficient. My code allows
: you to use these queries for floating point values, but in no way makes this
: use efficient :-(
:
: P.S. I wonder if anyone can suggest an alternative implementation for
: mantissaFormat() below, which doesn't require a DecimalFormat
: object creation.
:
:
: /**
:  *  The doubleToSortableString function does a transformation which is
:  *  currently necessary for implementing Range Constaints using Lucene's
:  *  RangeQuery. This function converts floating-point numbers into strings
:  *  which can be sorted in lexicographical order.
:  *
:  *  This conversion is easy for positive integers (which can be padded
:  *  with leading zero), but much more complicated to do right for doubles
:  *  which can be positive or negative, and have large ranges (1e-74 and
:  *  1e74 are both perfectly fine as doubles, and we should handle them).
:  *  We correctly handle numbers in double's entire range, but limit
:  *  precision to 10 significant digits (this was a deliberate choice - 10
:  *  decimal digits is enough to distinguish between 32 bit integers).
:  *
:  *  Unfortunately, we are not aware of a way of doing this sort of
:  *  conversion while keeping the numbers readable. The result of the
:  *  conversion (even for positive integers) will be unreadable to casual
:  *  readers. It is therefore recommended to "STORE" the original number
:  *  in the index, rather than the result of this conversion. It is
:  *  possible to do this with the aid of a special Analyzer, which will
:  *  carry out this conversion for the purpose of indexing (but will
:  *  leave the un-analyzed value STOREd).
:  *
:  *  The conversion we carry out is one-to-one (up to the 10 decimal
:  *  significant digits of precision), and it should be easy to carry
:  *  out the reverse conversion. However, we haven't implmented this yet
:  *  (because it shouldn't be necessary).
:  */
:
: /* Our string representation of a floating point number will look as
:  * follows:
:  *   SEEENN...
:  * Where:
:  *  1. S is the sign of the number, "n" for negative, "p" for positive.
:  *     This ensures that all negative numbers come before all positive
:  *     numbers. (In the alphabet, "n" comes before "p").
:  *  2. EEE is the base-10 exponent leaving a mantissa between >=0.1, <1.0.
:  *     We add 500 to it, to get all the negative and positive exponents
:  *     into 3 digits (typically, a double's exponent can be between -308
:  *     and 308). For negative numbers, we negate their exponent first.
:  *     The idea is that positive numbers with lower exponent come before
:  *     (are less) than those with higher exponent (1e-1 is smaller than
:  *     1e2) and for negative numbers, this is reversed (-1e-1 is larger
:  *     than -1e2).
:  *     We assume that for Java doubles, the maximum exponent is around 300,
:  *     so our addition of 500 is more than enough, and ensures that the
:  *     result doesn't even need padding with zeros.
:  *  3. NN... is the mantissa, >=0.1 and <1.0. When the original number is
:  *     positive, we output the mantissa as a base 10 floating point without
:  *     the initial "0.". For negative numbers, 1 minus the mantissa is
:  *     printed instead (the result is <=0.9, >0). The mantissa is printed
:  *     with a certain precision (up to 15 digits makes sense on most
:  *     computers, but we don't need this sort of precision and we'll settle
:  *     with 10 - enough for accurate representation of Java ints).
:  *     Trailing zeros are not printed, which is ok, because "0.1" is
:  *     lexicographically before "0.11".
:  * Positive and negative Infinities are also supported, and generate
:  * strings "pinf" and "n-inf", respectively, which sort after and before
:  * all numbers, respectively. NaNs (Not-a-Number) are not supported,
:  * because their sorting order relative to other numbers is undefined.
:  *
:  * Note that we chose base 10 representation because it was more convinient
:  * to code and debug, as well as giving a shorter representation for
:  * floating point numbers with a short decimal representation (say, 0.12).
:  * However, using a larger base would have yielded a more compact
:  * representation. It particular, it should be easy to half the length
:  * of this function's output by converting its output to base 100.
:  */
: public static String doubleToSortableString(double d){
:       boolean negative = d<0;
:       if(negative)
:             d= -d; // continue to work on the positive number
:       int exponent = (int)Math.floor(Math.log(d)/log10 + 1);
:       double mantissa = d * Math.exp(-exponent * log10);
:
:       // If the exponent is very negative, we're at 0 - just print "p0",
:       // which will sort before any positive number, but after every
:       // negative number.
:       if(exponent <= -400)
:             return "p0";
:       // If the exponent is too high, we're at infinity, either positive
:       // or negative. Print "n-inf" (sorts before all negative numbers we
:       // print) or "pinf" (sorts after all positive numbers we print).
:       if(exponent >= 400)
:             return negative ? "n-inf" : "pinf";
:
:       if(negative)
:             exponent = -exponent;
:       exponent += 500;
:
:       if(negative)
:             return "n"+Integer.toString(exponent)+
:                   mantissaFormat(1.0-mantissa);
:       else
:             return "p"+Integer.toString(exponent)+
:                   mantissaFormat(mantissa);
: }
: private static final double log10 = Math.log(10);
: // This function limits the number's precision to 10 significant decimal
: // digits. See comment above for the rationale for this choice.
: private static String mantissaFormat(double m){
:       // Unfortunately, we need to create a new "DecimalFormat" object
:       // every time, because that object is not thread safe. This may
:       // be a performance bottleneck.
:       DecimalFormat mantFormatter = new DecimalFormat(".##########");
:                                                                           //  1234567890
:       String result = mantFormatter.format(m);
:       if(result.charAt(0)=='.')
:             return result.substring(1); // remove initial "."
:       else // 0.999.. was printed as 1.0. We don't want that.
:             return "9999999999"; // we could have also printed "x" or anything that comes
after all numbers.
:                  // 1234567890
: }
:
:
:
: --
: Nadav Har'El
: nyh@il.ibm.com
: +972-4-829-6326
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message