Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 38744 invoked from network); 28 Feb 2006 20:55:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Feb 2006 20:55:04 -0000 Received: (qmail 76412 invoked by uid 500); 28 Feb 2006 20:54:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 76368 invoked by uid 500); 28 Feb 2006 20:54:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 76336 invoked by uid 99); 28 Feb 2006 20:54:55 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2006 12:54:55 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [66.105.41.166] (HELO bettylou.configureone.net) (66.105.41.166) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2006 12:54:54 -0800 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5 Subject: RE: Hacking proximity search: looking for feedback Date: Tue, 28 Feb 2006 14:53:20 -0600 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Hacking proximity search: looking for feedback Thread-Index: AcY8qEM1y3OH45ZMSSuGoo3dcPmfPgAAIj1g From: "John Powers" To: X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I don't know if this matters, but we do all of our geolocating in sql with decent speed. All the trig is in the query itself and then we can limit top 5, top 10 etc for what we show. Is the data such that you need lucene? Can I ask what causes it to be beyond a databases ability? -----Original Message----- From: Bryzek.Michael [mailto:Michael.Bryzek@uwa.unitedway.org]=20 Sent: Tuesday, February 28, 2006 2:49 PM To: java-user@lucene.apache.org Subject: RE: Hacking proximity search: looking for feedback Jeff - This is an interesting approach. On our end, we have experimented with two variants: Variant 1: Use Range Query Rather than precomputing the boolean clauses yourself, index encoded latitude and longitude values and use a Range Query. We encode by adding 1000 to each of the values. Note: We only deal with zip codes in the US for which this encoding works fine, but is worth another look if you have a broader range of coordinates. Example: Compute the box as you describe, then encode each value and use a RangeQuery. Using the box you provided and lucene's query parser: North latitude =3D 47.77704 West longitude =3D -122.41909 South latitude =3D 47.34827 East longitude =3D -122.22031 gives us this query: latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969] Lucene then handles expanding this query into the appropriate set of Boolean Queries. Variant 2: Combine the above w/ a custom scorer For us, boxing works well to retrieve relevant documents, but our users want those documents sorted by distance. We modified the custom scoring class that Hatcher provides in Lucene in Action to compute the distance between two points for only those documents which actually match. Thus, we do our search using a Range Query, then for all matching documents we compute an actual score that incorporates the distance from the actual location from which the user is searching.=20 On our data set, we can still end up with 1000s of matching documents after boxing. Thus, we still see a bottleneck computing the score for even this smaller set of documents which we are still working through. -Mike -----Original Message----- From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com] Sent: Tue 2/28/06 3:10 PM To: java-user@lucene.apache.org Cc:=09 Subject: Hacking proximity search: looking for feedback I've been wrestling with a way to index and search data with a geo-positional aspect. By a geo-positional search, I want to constrain search results within a given location range. Furthermore, I want to allow the user to set/change the geo-positional boundaries as needed for their search. This isn't a foreign concept to anyone who's needed to do the same. There's been some work done in this area publicly (or at least what I could find via the list archives and google), but not a lot. The guys at eventax.de have done some very impressive work. Their implementation is within the constructs of nutch; there's more here at http://wiki.apache.org/nutch/GeoPosition. Their approach is very interesting and is predicated by having data indexed in a certain format. I've considered building something based on the FunctionQuery class as well. Within this class, I could actually do the mathematical calculations (Haversine formula, anyone?) for geo-positional filtering. Range queries on this data set were an option as well. I've hit a performance wall with these approaches. The geo-positional calculations are proving to be intensive. With the combination of my data set, hardware, and OS, this just wasn't getting it done. So, I reversed my thinking. Instead of getting Lucene to do geo-math, what if I made geo-math do Lucene? Lucene is exceptional at string lookups; how could I do geo-positional search in that framework? I seized upon an approach that I've validated in testing, but wanted to get more feedback from the community. ************************************************************************ ******************************** Data structure: Latitude & longitude values in decimal format, i.e. = latitude=3D47.480227, longitude=3D-122.333670 (btw, a tire shop I recently visited). Geo definition: Boxing around a center point. It's not critical to do a radius search with a given circle. A boxed approach allows for taller or wider frames of reference, which are applicable for our use. How I'm solving: Treat the data as strings and formulate boolean query lookups based on positional comparison. Here are the steps: Indexing: 1. Break up lat/long values to individual characters, and store each field in progression. Field storage type =3D Keyword. The tire shop example: Lat1 =3D [4] Lat2 =3D [7] Lat3 =3D [.] Lat4 =3D [4] Lat5 =3D [8] ... Long1 =3D [-] Long2 =3D [1] Long3 =3D [2] ... Searching: 1. Start with box coordinates - North/West corner, South/East corner. For example, a sample box around Seattle, WA: North latitude =3D 47.77704 West longitude =3D -122.41909 South latitude =3D 47.34827 East longitude =3D -122.22031 2. Break up lat/long values in a manner similar to indexing. 3. Begin to build boolean query. Query will contain two required clauses: the North/South query, and the West/East query. Both queries are built using the same progressional evaluation of characters by position. 4. Compare North/South (top/bottom) values. Here's a pseudo-graphical chart: North =3D [4][7][.][7][7][7][0][4] South =3D [4][7][.][3][4][8][2][7] Qualifying records will have latitude values between 47.77704 and 47.34827. With lexicographical ordering in mind, I can safely include this phrase in my North/South query: (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6) The logic is that any latitude with the first three values matching, and the fourth position containing either 4 or 5 or 6 will yield a qualifying match. What other queries are needed? Ones that match 7 or 3 in the fourth position should be considered, but they need further qualification. The further qualification is based on values from the fifth position. I can safely included the following phrases in my North/South query: (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2 Lat5:1 Lat5:0) (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9) The logic here is simply an extension of the first query, extended to next position in the latitude range. In the Northern latitude, with the first four positions matching those values exactly, anything less than 7 in the fifth position will yield a matching latitude. Same goes for the South (bottom) query. ************************************************************************ ******************* This approach yields a higher number of boolean query clauses, the more granular you get. In this scenario, I've estimated approximately 145 clauses within the final constructed query. In validation testing, this approach has proven to be: 1) Accurate. 2) Performant (thus far). At last, my question to everyone who cares to respond (and read this far): feedback? Thanks, -- jeff --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org