incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Hsu <pe...@motivecast.com>
Subject Data modeling question
Date Sat, 30 Jun 2012 00:13:10 GMT
I have a question on what the best way is to store the data in my schema.

The data
I have millions of nodes, each with a different cartesian coordinate.  The keys for the nodes
are hashed based on the coordinate.

My search is a proximity search.  I'd like to find all the nodes within a given distance from
a particular node.  I can create an arbitrary grouping that groups an arbitrary number of
nodes together, based on proximity… 

e.g. 
group 0  contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).

For each coordinate, I store various meta data:
 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType

The query
I need a proximity search to return all data within a range from a selected node.  The typical
read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node)..  Since it's
on a coordinate system, I know ahead of time exactly which 100 rows I need.

The modeling options

Option 1:
 - single column family, with key being the coordinate hash

e,g,
'0,0' : { meta }
'0,1' : { meta }
…
'10, 20' : { meta}

 - query for 100 rows in parallel

 - I think this option sucks because it's essentially 100 non-sequential reads??

Option 2:
 - group my data into super columns, with key being the grouping

e.g.
 '0' {
  '0, 0' : { meta }
 ...
  '10, 10' : { meta }
 }
'1' {
 '10, 0' : {meta}
…
 '20, 10': {meta}
}


 - query by the appropriate grouping 
 - since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking
at querying up to 4 different super column rows for each query
 - this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in
terms of pre-filtering and post-filtering
 - sucks in terms of flexibility for modifying size of proximity search

Option 3:
 - create a secondary index based on the grouping

e.g.

e,g,
'0,0' : { meta, group='0' }
'0,1' : { meta, group='0' }
…
'10, 20' : { meta, group='1'}

 - query by secondary index
 - same as above, will return some extra data, and will need to do filtering..
 - no idea how cassandra stores this data internally, but will the data access here be sequential?
 - a little more flexible in terms of proximity search - can create multiple grouping types
based on the size of the search

Option 4:
 - composite queries??
 -- I haven't had time to read up too much on this, so I'm not sure if it would help for my
use case or not.

questions
 - I know there are pros and cons to each approach wrt flexibility of my search size, but
assuming my search proximity size is fixed, which method provides the optimal performance?
 - I guess the main question is will querying by secondary index be efficient enough or is
it worth it to group the data into super columns?
 - Is there a better way I haven't thought about to model the data?



Mime
View raw message