lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Rodenburg" <jeff.rodenb...@gmail.com>
Subject Hacking proximity search: looking for feedback
Date Tue, 28 Feb 2006 20:10:43 GMT
I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.  For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range.  In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message