Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Date: Wed, 27 Jul 2005 19:42:18 -0600 (MDT)
From: Barry Carter <barry.carter@bigfoot.com>
To: java-user@lucene.apache.org
Subject: Lucene vs Derby (vs MySQL) for spatial indexing
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Message-Id: <20050728014138.9EC4D10FB2B7@asf.osuosl.org>

Does Lucene optimize range queries that use Sort and/or limit the number 
of hits?

My situation: I have a listing of 2 million cities, with the name,
latitude, longitude, and population of each city. I want to efficiently
find the 50 most populous cities between (for example) latitudes 35.2 and
41.7 and longitudes 19.8 and 27.9

Assuming I normalize the data to be lexically sorted (in other words, I'll
write 7.52 as 007.52, so it comes before 111.01 instead of after it), can
I use a range query on the latitude and longitude fields (limiting the
number of hits to 50, and sorting by population descending) to efficiently
find what I want?

If sorting isn't efficient, can I simply boost each record by its
population (so that high population cities are returned first) and then
limit the number of hits (so I see only the 50 most populous cities in a
given area)?

I tried this in Derby, the code being:

Statement s = DriverManager.getConnection("jdbc:derby:test;create=false").createStatement();
s.setMaxRows(50);
rs = s.executeQuery("SELECT * FROM cities where lat>35.2 and lat<41.7 and lon>19.8 and lon<27.9 ORDER BY population desc");

but Derby inefficiently looks at ALL the cities matching my criteria (even
with indexes on lat and lon and population) before returning the top 50
(this is really bad when the condition is "lat>-90 and lat<90 and lon>-180
and lon<-180", for example).

The MySQL equivalent ("SELECT * FROM cities where lat>35.2 and lat<41.7
and lon>19.8 and lon<27.9 ORDER BY population desc LIMIT 50") with the
same indexes is more efficient (it uses the LIMIT condition to optimize
the query), and using MySQL w/ spatial indexes is even more efficient.  
However, I'm doing this as part of a Java application, so need something
that can be embedded in Java.

Is this a reasonable use of Lucene? Or is coercing Lucene into doing
range-based numeric queries a bad idea?

(In case anyone's interested, I'm writing a zoomable/pannable world map,
so finding the biggest cities in a given area quickly is important)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org