Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Hacking proximity search: looking for feedback
Date: Tue, 28 Feb 2006 14:53:20 -0600
Message-ID: <F9E303076B72074AB76B2C8F78952B291C7B9B@bettylou.configureone.net>
Thread-Topic: Hacking proximity search: looking for feedback
Thread-Index: AcY8qEM1y3OH45ZMSSuGoo3dcPmfPgAAIj1g
From: "John Powers" <jpowers@configureone.com>
To: <java-user@lucene.apache.org>

I don't know if this matters, but we do all of our geolocating in sql
with decent speed.   All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show.    Is the data such that you
need lucene?    Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael [mailto:Michael.Bryzek@uwa.unitedway.org]=20
Sent: Tuesday, February 28, 2006 2:49 PM
To: java-user@lucene.apache.org
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

       North latitude =3D 47.77704
       West longitude =3D -122.41909
       South latitude =3D 47.34827
       East longitude =3D -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.


Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.=20

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike


-----Original Message-----
From:	Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
Sent:	Tue 2/28/06 3:10 PM
To:	java-user@lucene.apache.org
Cc:=09
Subject:	Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. =
latitude=3D47.480227,
longitude=3D-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search
with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression.  Field storage type =3D Keyword.  The tire shop example:
Lat1 =3D [4]
Lat2 =3D [7]
Lat3 =3D [.]
Lat4 =3D [4]
Lat5 =3D [8]
...
Long1 =3D [-]
Long2 =3D [1]
Long3 =3D [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.
For
example, a sample box around Seattle, WA:
North latitude =3D 47.77704
West longitude =3D -122.41909
South latitude =3D 47.34827
East longitude =3D -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required
clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North =3D [4][7][.][7][7][7][0][4]
South =3D [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range.  In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org