I've implemented a basic geospatial search against a Cassandra dataset by keeping a column family of items indexed by geohash (http://en.wikipedia.org/wiki/Geohash). Essentially, to search for items within a given area, you calculate a geohash that covers the entire area (but is still as specific as possible) and use it to do a prefix query (i.e. range scan) on the index.

It has a weakness in that it fares badly in, for example, London, where your search area may straddle the boundary between positive and negative longitude. Hashes for points either side of this boundary (e.g. http://geohash.org/u10hb7951 and http://geohash.org/gcpuzewfz) have no prefix in common and so you'll end up doing a prefix scan on the empty string, pulling in everything. The same problem will arise (albeit less dramatically) in areas that contain points with only a short prefix in common.

Off the top of my head, there are two things you can do about this. Firstly, you can use a less concise geohash implementation that subdivides the world less quickly, increasing your chances of finding a closely-fitting bounding hash (i.e. common prefix) for any given area. Secondly, rather than find a single bounding hash, you could generate multiple hashes that cover the area, perform a prefix query for each of them and aggregate the results. This blog post describes the latter approach in more detail, along with some theoretical optimisations:


I also plan to investigate Local Lucene's approach, which uses Cartesian tiers:


On Fri, Mar 19, 2010 at 1:20 PM, Brandon Williams <driftx@gmail.com> wrote:
On Fri, Mar 19, 2010 at 9:06 AM, Joseph Stein <cryptcom@gmail.com> wrote:
Hi All, has anyone ever done geo coding to find distance based results
from storing long/lats with a starting long/lat and variable?

This thread might be helpful: