lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryzek.Michael" <>
Subject RE: Hacking proximity search: looking for feedback
Date Tue, 28 Feb 2006 20:48:44 GMT
Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query

       North latitude = 47.77704
       West longitude = -122.41909
       South latitude = 47.34827
       East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching. 

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.


-----Original Message-----
From:	Jeff Rodenburg []
Sent:	Tue 2/28/06 3:10 PM
Subject:	Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot.  The guys at have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at  Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.


Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

1. Break up lat/long values to individual characters, and store each field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
Long1 = [-]
Long2 = [1]
Long3 = [2]

1. Start with box coordinates - North/West corner, South/East corner.  For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range.  In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.


This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this far):

-- jeff

View raw message