Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 50953 invoked from network); 14 Mar 2009 17:21:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Mar 2009 17:21:30 -0000 Received: (qmail 36547 invoked by uid 500); 14 Mar 2009 17:21:30 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 36526 invoked by uid 500); 14 Mar 2009 17:21:29 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 36517 invoked by uid 99); 14 Mar 2009 17:21:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Mar 2009 10:21:29 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO aurora.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Mar 2009 17:21:28 +0000 Received: from aurora.apache.org (localhost [127.0.0.1]) by aurora.apache.org (8.13.8+Sun/8.13.8) with ESMTP id n2EHL7YO029454 for ; Sat, 14 Mar 2009 17:21:07 GMT Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: java-commits@lucene.apache.org Date: Sat, 14 Mar 2009 17:21:07 -0000 Message-ID: <20090314172107.29367.70059@aurora.apache.org> Subject: [Lucene-java Wiki] Update of "SearchNumericalFields" by UweSchindler X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification. The following page has been changed by UweSchindler: http://wiki.apache.org/lucene-java/SearchNumericalFields The comment on the change is: update docs ------------------------------------------------------------------------------ == TrieRangeQuery (in contrib/search since version 2.9-dev, which is not yet released) == - Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). A contrib extension was developed, that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, and timestamps are converted to lexicographic sortable string representations and stored with different precisions from one byte to the full 8 bytes - depending on the variant used). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, the boundaries are matched more exactly. This reduces the number of terms dramatically. See: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/trie/package-summary.html + Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, Dates, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions. For a more detailed description of how the values are stored, see TrieUtils. A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly. This reduces the number of terms dramatically. See: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/trie/package-summary.html This dramatically improves the performance of Apache Lucene with range queries, which is no longer dependent on the index size and number of distinct values because there is an upper limit not related to any of these properties. - Trie''''''Range''''''Query can be used for date/time searches (if you need variable precision of date and time downto milliseconds), double searches (e.g. spatial search for latitudes or longitudes), prices (if encoded as long using cent values, doubles are not good for price values because of rounding problems). The document fields containing the trie encoded values are generated by the Trie''''''Utils class. The values can also be stored in index using the trie encoding, for displaying they can be converted back to the primitive types. Trie''''''Utils also supplies a factory for Sort''''''Field instances on trie encoded fields that automatically uses an Extended''''''Field''''''Cache.Long''''''Parser for efficient sorting of the primitive types. + Trie''''''Range''''''Query can be used for date/time searches (if you need variable precision of date and time downto milliseconds), double searches (e.g. spatial search for latitudes or longitudes), prices (if encoded as long using cent values, doubles are not good for price values because of rounding problems). The document fields containing the trie encoded values are generated by the Trie''''''Utils class. The values can also be stored in index using the trie encoding, for displaying they can be converted back to the primitive types. Trie''''''Utils also supplies a factory for Sort''''''Field instances on trie encoded fields that automatically uses an Extended''''''Field''''''Cache.Long''''''Parser or ''''''Field''''''Cache.Int''''''Parser for efficient sorting of the primitive types. - - Currently Trie''''''Range''''''Query is only available for 64bit values (long, double, Date), 32bit (int, float) is in preparation. Because of the trie encoding, the additional unused bits are no problem for search performance, but index size is larger (more terms per numerical document field). == Other possibilities with storing numerical values stored in more readable form in index ==