On Mon, Mar 1, 2010 at 3:58 PM, Ian Dees <ian.dees@gmail.com> wrote:
Hi List,

I was attempting to play around with Cassandra last month using the example of the OpenStreetMap database as a target to frame my learning. Now that I have a little extra time to start this endeavor again, I'm wondering if you could help me understand how the OpenStreetMap data model could fit into the Cassandra model.

You've chosen something that is fairly complicated for playing around. :)

A quick overview of the data model I have in mind:
1. There is a "node" which has:
     - Location (lat/lon)
     - Numeric id 
     - Tags (list of key/value pairs) 
2. A "way" has:
     - An ordered list of "node"
     - Numeric id
     - Tags (list of key/value pairs) 
There is one step beyond this, but I'm wondering if you could help me fit this simple first step into Cassandra.

There is almost always more than one way to model an application.  How you need to query your data is usually the most important factor when modelling. 

My queries would be something like the following:
1. What are the nodes in a given bounding box, what are the ways attached to those nodes?

This is the greatest constraint.  You need to query in two dimensions, so to simplify this I might suggest storing the node coordinates in a Z-order curve: http://en.wikipedia.org/wiki/Z-order_(curve)  This will reduce the dimensionality so that you can more easily range scan or slice without having to do multiple queries and then perform intersection in the client.  There is a research paper on this technique in range-capable DHTs here: http://www.geo.unizh.ch/~rsp/gir06/papers/individual/soro.pdf

Another approach that was suggested to me on irc might be to partition areas into fixed-size 'chunks', identified by the upper-left corner.  These would become the row keys, with the columns being the node keys.  Since the chunks all have the same height and width, it's relatively straightforward to convert a bounding box to the short list of chunks that you need to query, but you'll have to do some filtering client-side to meet the exact bounds.
Once you have the nodes, finding their ways in another column family where the rows are nodes and their ways are columns should be relatively easy.

2. What are the tags and nodes for a way with a Numeric id of x?

Use column families where the row keys are the numeric IDs, and the columns are the tags or nodes.

3. What are the nodes that have a tag key of "foo"? How about nodes that have "foo" = "bar"?

Use another column family where the row keys are the tags and the columns are the node ids.  For the 'foo=bar' situtation, make that the tag name.