Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 35574 invoked from network); 9 Apr 2006 12:02:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Apr 2006 12:02:55 -0000 Received: (qmail 61773 invoked by uid 500); 9 Apr 2006 12:02:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61660 invoked by uid 500); 9 Apr 2006 12:02:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61621 invoked by uid 99); 9 Apr 2006 12:02:45 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Apr 2006 05:02:45 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of NYH@il.ibm.com designates 195.212.29.151 as permitted sender) Received: from [195.212.29.151] (HELO mtagate2.de.ibm.com) (195.212.29.151) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Apr 2006 05:02:44 -0700 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate2.de.ibm.com (8.13.6/8.13.6) with ESMTP id k39C2MOn199936 for ; Sun, 9 Apr 2006 12:02:22 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.12.10/NCO/VER6.8) with ESMTP id k39C3DDe240608 for ; Sun, 9 Apr 2006 14:03:13 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11/8.13.3) with ESMTP id k39C2L5I031920 for ; Sun, 9 Apr 2006 14:02:21 +0200 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11/8.12.11) with ESMTP id k39C2LvE031914 for ; Sun, 9 Apr 2006 14:02:21 +0200 Subject: solution: RangeQuery with floating point numbers To: java-user@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 August 18, 2005 Message-ID: From: "Nadav Har'El" Date: Sun, 9 Apr 2006 14:53:31 +0300 X-MIMETrack: Serialize by Router on D12MC102/12/M/IBM(Release 7.0HF90 | November 16, 2005) at 09/04/2006 15:03:12 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi all, As Lucene's documentation explain, RangeQuery (and ConstantScoreRangeQuery) require their key to be strings which are lexicographically (alphabetically) ordered. "Lucene in Action", section 6.3.3 ("Handling numeric field-range queries") explains what to do when you need to range search in positive integers: in that case, you have to pad the numbers with leading zeros, so they all have the same length and sort correctly. Other people on this list suggested how to extend this trick for when you have signed integers. But I wanted to go all the way to general "double"s, and do a RangeQuery on them. Converting doubles into strings which sort in the correct order is a non-trivial exercise, but is in fact doable. To prevent others from needing to repeat this exercise, here is an example code that I wrote, that does this. The code below isn't as long as it looks: most of it is comments which hopefull explain what I've done. By the way, it is worth repeating the warning that appears everywhere (including Lucene In Action): while this sort of trick works, RangeQuery is extremely inefficient (and might not even work) when the number of different values that are included in the range is very large. ConstantScoreRangeQuery is better, but also inefficient. My code allows you to use these queries for floating point values, but in no way makes this use efficient :-( P.S. I wonder if anyone can suggest an alternative implementation for mantissaFormat() below, which doesn't require a DecimalFormat object creation. /** * The doubleToSortableString function does a transformation which is * currently necessary for implementing Range Constaints using Lucene's * RangeQuery. This function converts floating-point numbers into strings * which can be sorted in lexicographical order. * * This conversion is easy for positive integers (which can be padded * with leading zero), but much more complicated to do right for doubles * which can be positive or negative, and have large ranges (1e-74 and * 1e74 are both perfectly fine as doubles, and we should handle them). * We correctly handle numbers in double's entire range, but limit * precision to 10 significant digits (this was a deliberate choice - 10 * decimal digits is enough to distinguish between 32 bit integers). * * Unfortunately, we are not aware of a way of doing this sort of * conversion while keeping the numbers readable. The result of the * conversion (even for positive integers) will be unreadable to casual * readers. It is therefore recommended to "STORE" the original number * in the index, rather than the result of this conversion. It is * possible to do this with the aid of a special Analyzer, which will * carry out this conversion for the purpose of indexing (but will * leave the un-analyzed value STOREd). * * The conversion we carry out is one-to-one (up to the 10 decimal * significant digits of precision), and it should be easy to carry * out the reverse conversion. However, we haven't implmented this yet * (because it shouldn't be necessary). */ /* Our string representation of a floating point number will look as * follows: * SEEENN... * Where: * 1. S is the sign of the number, "n" for negative, "p" for positive. * This ensures that all negative numbers come before all positive * numbers. (In the alphabet, "n" comes before "p"). * 2. EEE is the base-10 exponent leaving a mantissa between >=0.1, <1.0. * We add 500 to it, to get all the negative and positive exponents * into 3 digits (typically, a double's exponent can be between -308 * and 308). For negative numbers, we negate their exponent first. * The idea is that positive numbers with lower exponent come before * (are less) than those with higher exponent (1e-1 is smaller than * 1e2) and for negative numbers, this is reversed (-1e-1 is larger * than -1e2). * We assume that for Java doubles, the maximum exponent is around 300, * so our addition of 500 is more than enough, and ensures that the * result doesn't even need padding with zeros. * 3. NN... is the mantissa, >=0.1 and <1.0. When the original number is * positive, we output the mantissa as a base 10 floating point without * the initial "0.". For negative numbers, 1 minus the mantissa is * printed instead (the result is <=0.9, >0). The mantissa is printed * with a certain precision (up to 15 digits makes sense on most * computers, but we don't need this sort of precision and we'll settle * with 10 - enough for accurate representation of Java ints). * Trailing zeros are not printed, which is ok, because "0.1" is * lexicographically before "0.11". * Positive and negative Infinities are also supported, and generate * strings "pinf" and "n-inf", respectively, which sort after and before * all numbers, respectively. NaNs (Not-a-Number) are not supported, * because their sorting order relative to other numbers is undefined. * * Note that we chose base 10 representation because it was more convinient * to code and debug, as well as giving a shorter representation for * floating point numbers with a short decimal representation (say, 0.12). * However, using a larger base would have yielded a more compact * representation. It particular, it should be easy to half the length * of this function's output by converting its output to base 100. */ public static String doubleToSortableString(double d){ boolean negative = d<0; if(negative) d= -d; // continue to work on the positive number int exponent = (int)Math.floor(Math.log(d)/log10 + 1); double mantissa = d * Math.exp(-exponent * log10); // If the exponent is very negative, we're at 0 - just print "p0", // which will sort before any positive number, but after every // negative number. if(exponent <= -400) return "p0"; // If the exponent is too high, we're at infinity, either positive // or negative. Print "n-inf" (sorts before all negative numbers we // print) or "pinf" (sorts after all positive numbers we print). if(exponent >= 400) return negative ? "n-inf" : "pinf"; if(negative) exponent = -exponent; exponent += 500; if(negative) return "n"+Integer.toString(exponent)+ mantissaFormat(1.0-mantissa); else return "p"+Integer.toString(exponent)+ mantissaFormat(mantissa); } private static final double log10 = Math.log(10); // This function limits the number's precision to 10 significant decimal // digits. See comment above for the rationale for this choice. private static String mantissaFormat(double m){ // Unfortunately, we need to create a new "DecimalFormat" object // every time, because that object is not thread safe. This may // be a performance bottleneck. DecimalFormat mantFormatter = new DecimalFormat(".##########"); // 1234567890 String result = mantFormatter.format(m); if(result.charAt(0)=='.') return result.substring(1); // remove initial "." else // 0.999.. was printed as 1.0. We don't want that. return "9999999999"; // we could have also printed "x" or anything that comes after all numbers. // 1234567890 } -- Nadav Har'El nyh@il.ibm.com +972-4-829-6326 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org